+ All Categories
Home > Documents > Machine Learning for causal Inference on Observational...

Machine Learning for causal Inference on Observational...

Date post: 20-May-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
50
Machine Learning for causal Inference on Observational Data Author: Hernán E. BORRÉ A thesis submitted for the degree of Master of Science in Artificial Intelligence Supervisor: Dr. Spyros Samothrakis School of Computer Science and Electronic Engineering University of Essex August 2018
Transcript
Page 1: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

Machine Learning for causal Inference onObservational Data

Author Hernaacuten E BORREacute

A thesis submitted for the degree of Master of Science in Artificial Intelligence

Supervisor Dr Spyros SamothrakisSchool of Computer Science and Electronic Engineering

University of Essex

August 2018

iii

Declaration of AuthorshipI Hernaacuten E BORREacute declare that this thesis titled ldquoMachine Learning for causalInference on Observational Datardquo and the work presented in it are my own I confirmthat

bull This work was done wholly or mainly while in candidature for a research de-gree at this University

bull Where any part of this thesis has previously been submitted for a degree orany other qualification at this University or any other institution this has beenclearly stated

bull Where I have consulted the published work of others this is always clearlyattributed

bull Where I have quoted from the work of others the source is always given Withthe exception of such quotations this thesis is entirely my own work

bull I have acknowledged all main sources of help

bull Where the thesis is based on work done by myself jointly with others I havemade clear exactly what was done by others and what I have contributed my-self

SignedHernaacuten Emilio Borreacute

Date28082018

v

ldquoThirty years ago we used to ask Can a computer simulate all processes of logic The answerwas yes but the question was surely wrong We should have asked Can logic simulate allsequences of cause and effect And the answer would have been no rdquo

Gregory Bateson Mind and Nature

vii

UNIVERSITY OF ESSEX

Abstract

Faculty Name

School of Computer Science and Electronic Engineering

Master of Science in Artificial Intelligence

Machine Learning for causal Inference on Observational Data

by Hernaacuten E BORREacute

The established scientific way to make claims about cause and effect is to performa Randomized Controlled Trial (RCT) However although RCTs are the best way todetermine causal effects the chances to perform such rigorous scientific experimentsis most often either impossible or unethical The Average Treatment Effect (ATE) isusually the outcome of the RCT experiments and this outcome is ideally proof of aneffect under the studied population which hopefully extends to other individualsIn contrast it is most common to find Observational Data in which the data that hasbeen collected might be heavily unbalanced for treatment assignments or the pa-tients covariates might come from completely different distributions Neverthelessthe ultimate goal of causal effects is to find the specific Individual Treatment Effect(ITE) for each patient Identifying the Individual Treatment Effect is a topic thathas always been important in the field of causality especially within the machinelearning community

Applications of such predictions are related with medicine but can be extensivelyused in financial investments advertisement placements recommender systems forretail and social sciences and beyond

The ability to learn complex non-linear relationships of some machine learning algo-rithms have been trying to detect and predict policies in which given the particularfeatures of an individual (patient) the algorithms could determine whether or not toapply the treatment to them

In this thesis the ITE will be predicted using a benchmark semi synthetic-datasetwhich has been unbalanced Assuming strong ignorability alternative machine learn-ing techniques that had not been tested in past publications will be applied to predictthe ITE from observational data The results obtained are compared with state-of-the-art outcomes some of the algorithms applied in this work performed similarlyto more complex custom designed methods

In addition a full review of all recent literature in the machine learning applied tocausal inference has been done

ix

AcknowledgementsFirst and foremost I would like to thank my supervisor Dr Spyros Samothrakis Hehas been an essential pillar in the whole process of this dissertation and my careergiving me not only academic professional support but also encouraging me to makethe most out of this process

Second I would like to thanks my parents who had always been listening to myissues stress periods and emotional crisis and cheered me up every time I neededthem Without them I would not be able to be here writing this dissertation by anychance They made me believe that everything is possible in life if you try hardenough and you are a honest person Infinitely grateful to them forever

Third to my former university Professors from Universidad Tecnoloacutegica NacionalFactual Regional Buenos Aires Dr Oscar Bruno Dr Alejandro Prince and DraMariacutea Florencia Pollo-Cataneo for their recommendation letters support throughthe year and enlightenment in my professional career giving me always the bestadvice I could always get

Fourth I would like to thanks Dr Uri Shalit who offered me immediate help andadvise on this dissertationrsquos topic and who also helped me on the full IHDP datasetcollection metrics for benchmark comparisons

Fifth to all my classmates (some of them friends now) who spent with me count-less hours discussing about our passion making the world a better place throughMachine Learning and Artificial Intelligence

Last but not least I would like to thanks the Government of Argentina and the Ar-gentinian Ministry of Education for giving me the chance of coming to study to oneof the best universities in the world throughout the BECAR scholarship

xi

Contents

Declaration of Authorship iii

Abstract vii

Acknowledgements ix

1 Introduction 111 Motivation 112 Purpose and Research Question 213 Approach and Methodology 314 Scope and Limitation 4

2 Background 5201 Rubin-Newman Causal Model 5202 The fundamental problem of causal analysis 6203 Metrics for Causality 6204 Assumptions 7205 Definitions 7206 Related Work 7

21 Machine Learning 10211 Ordinary Least Squares (Linear Regression) 10212 Ridge Regression 11213 Support Vector Regressor 11214 Bayesian Ridge 11215 Lasso 12216 Lasso Lars 12217 ARD Regression 12218 Passive Aggressive Regressor 12219 Theil Sen Regressor 122110 K-Neighbors Regressor 132111 Logistic Regression 13

3 Methodology 1531 Dataset 1532 IHDP dataset 1633 Other articles metrics 17

4 Experiments 1941 Machine learning methods applied to IHDP dataset 2042 Other experiments 27

421 Recursive Feature Elimination 27422 Domain Adaptation Neural Networks 27

43 Discussion 27

xii

5 Conclusions 2951 Concluding Remarks 2952 Future work 30

Bibliography 31

xiii

List of Tables

41 IHDP 10 replications with traditional machine learning algorithms -Within sample 20

42 IHDP 10 replications with traditional machine learning algorithms -Out-of-sample 21

43 IHDP 100 replications - Within sample 2144 IHDP 100 replications - Out-of-sample 2145 IHDP 100 replications already split dataset - Within sample 2246 IHDP 100 replications already split dataset - Out-of-sample 2247 IHDP 100 replications - No scaling - Within sample 2348 IHDP 1000 replications - No Scaling - Out-of-sample 2349 IHDP 100 replications - Scaled - Within sample 23410 IHDP 1000 replications - No Scaling - Out-of-sample 24411 IHDP 100 replications logistic regressions - Within sample 24412 IHDP 100 replications logistic regressions - Out-of-sample 24413 IHDP 100 replications SVR Hyper-parameters tunning - Within sample 25414 IHDP 100 replications SVR Hyper-parameters tunning - Out-of-sample 25415 ICML 2017 - Estimating individual treatment effect generalization

bounds and algorithms (Shalit Johansson and Sontag 2017) 26416 ICML 2017 - Estimating individual treatment effect generalization

bounds and algorithms (Shalit Johansson and Sontag 2017) 26417 Domain Adaptation Neural Networks 27

xv

List of Abbreviations

ML Machine LearningSVR Support Vector RegressorRL Reinforcement LearningNN Neural NetworksLR Linear RegressionKNN K Nearest NeighboursRCE Randomized Controlled ExperimentITE Individual Treatment EffectATE Avarage Treatment EffectPEHE Precision in Estimation of Heterogenous EffectsCATE Conditional Average Treatment EffectRCM Rubin Casual Model

1

Chapter 1

Introduction

11 Motivation

Causality is often confused with correlation Correlation does not imply causationThese inferences are often called spurious correlations and they often confuses theinference process in which humans make decisions

A common definition of Causality is still not agreed by the scientific communitynowadays

The proven scientific way to make claims about cause and effect it is to performwhat is called a Randomized Controlled Trial(RCT) In a Randomized ControlledTrial a statistically representative portion of the population that will be participatingof the experiment (trial) are exposed to a treatment(action) which could be eitherpositive - apply the treatment - or neutral (control) -giving the patient a placebo ornot treating the patient(unit) at all

All these concepts are related to medical words since the field in which RCTs areapplied the most is medical trials However it is not the only industry in whichthese concept of dragging conclusions from a trial can be done For example it iswidely used in social studies but can also be applied to make decisions on buyingselling or holding a particular stock or displaying an advertisement that generatemore sales than the others in the advertisements industry

Nevertheless the Randomized Controlled Trials are the best way to detect causaleffects the possibility of perform such scientific rigorous experiments is most ofthe times either impossible or unethical An example could be seem when tryingto detect if driving while being under the effects of alcohol can affect (or not) thedriverrsquos skills Another clear example of this is determining the causes of smokingin teenagers or young people in which to perform a RCT would involve to take twogroups of non-smoking teenagers make half of the units smoke for several years - itdepends on the experiment or the research question - and then determine if smokingin young people would cause something or not after that period As the readercan infer there are clearly ethical problems associated with performing full RCTs todetermine causes and effects

It is also important to let clear that there are cases in which it would be impossi-ble to analyze previously collected data without having in mind the RCT methodThese cases are known as observational data Observational data is defined as infor-mation that can be obtained from previously collected situations in which a formal

2 Chapter 1 Introduction

randomized trial method has not been applied but it is still important to try to de-termine causes and effects from that data This is the case in most organizations inthe present since they would possibly be collecting massive amounts of data duringthe last decades but they could not or would not be able to establish a RCT processduring collection of the metrics Moreover sometimes the data collection processhappens in a non randomized controlled experiment For example a not so com-mon disease that just affects a small percentage of the population might happen toappear within a wide range of people that makes the inference process difficult

In the past causal Inference methods have been a Statisticians only field How-ever with recent advances of machine learning algorithms more computer scientistsand machine learning engineers have been trying to infer causes and relationshipsthrough traditional and new machine learning techniques

The ability to learn complex non-linear relationships of some machine learning algo-rithms have been trying to detect and predict policies in which given the particularfeatures of an individual (patient) the algorithm could determine if to apply or notthe treatment (action) to them This concept is also known as Individual Treatmenteffect estimation or it could also be referred as Policy risk when predicting a binaryclass which is is to apply or not the treatment through the discovery of a certainthreshold

It is a matter of interest to be able to predict the individual (customized) treatmenteffect because this would lead to better decisions(actions or treatments) specificallyshaped for each person and not only relaying on the average of the whole studiedpopulation

The ultimately motivation of this work is to be able to predict the IndividualTreatment Effects for new patients with the previously collected data using al-ternative machine learning techniques to the ones used in past research effortsMoreover the compiling of a literature that can be understood by more computerscientist will be tried to present as well as running code of all the experiments per-formed will be released for others to build upon it

12 Purpose and Research Question

The purpose of this dissertation is to predict the Individual Treatment Effect (ITE)the Average Treatment Effect (ATE) and the Precision in Estimation of Heteroge-neous Effect (PEHE) for a widely adopted benchmark dataset in the field usuallyrefer to this as IHDP (Infant Health Developed Program) which is a semi-syntheticdataset particularly unbalanced and created for the task of causal inference on ob-servational data(Hill 2011)

The research experiments are about trying alternative machine learning methodswithout adding extra complexity custom error loss or custom metric functions whilelearning and predicting would be able to obtain similar or even better results thanthe state-of-the-art metrics based on the exact same benchmark dataset

13 Approach and Methodology 3

13 Approach and Methodology

For causality analysis purposes the Rubin Potential Outcomes Framework (CausalInference Using Potential Outcomes Design Modeling Decisions) and its notation willbe used during this whole thesis This model is also known as the Newman-Rubincausal model and it is an approach to statistical analysis of cause and effect based onthe potential outcomes framework

For the machine learning experiments the latest version of the scikit-learn(Userguide contents mdash scikit-learn 0192 documentation) python framework had been usedAll their underlaying methods and default hyper-parameters had been used Alsothe mathematical notation of their documentation will be presented to describe eachalgorithmrsquos functions and limitations

The different algorithms had been tested with a benchmark standard dataset for causalinference from observational data Infant Health and Development Program (IHDP)introduced by (Hill 2011) as a semi synthetic-dataset based on real features ob-tained from a real an observational study (Gross 1993) Replications on this datasethad been created to get 10 100 and 1000 cases to be able to train and predict the ma-chine learning models on them and get the desired metric error results afterwards

It is important to notice that the testing method and the metrics used to determinethe effectiveness of each algorithm are different from the ones normally used to testmachine learning algorithms and testing these algorithms within a causal frame-work differs substantially from the usual traintest paradigm of the machine learn-ing field

Since a synthetic-dataset has been used the real Individual Treatment Effect is avail-able to perform testing metrics Therefore the experiment results will consist ofthe performance of each algorithm based on Individual Treatment Effect AverageTreatment Effect (ATE) and Precision in Estimation of Heterogeneous Effect (PEHE)These three are the metrics displayed in the Experiments chapter for each algorithmtrained A detailed explanation of these formulas will be found in the next sections203

Also it is important to notice that the machine learning algorithms are trained justusing the treatment applied (observed) the features (covariates in causal inferenceliterature) and the observed outcome (usually known as Y or Y factual) Aftertraining an completely unseen dataset during is used for testing purposes

The already trained algorithm predicts just the Y factual based on the unit (alsoknown as patient) features (covariates) for the both the cases the unit would havetaken the treatment and likewise predicts the outcomes as if the unit would havetaken the control treatment Once both outcomes are predicted (Y factual and Ycounterfactual the ATE ITE and PEHE metrics are calculated In addition an aver-age score and its deviation for each run of the 10 100 and 1000 replications of theIHDP had been run to evaluate these errors in bigger simulated scenarios

The mathematical notation will be kept as minimum as possible to not confuse thereader with unnecessary information

4 Chapter 1 Introduction

14 Scope and Limitation

In this document the outcomes of the applied treatment to a patient will be onlyanalyzed with respect two possible actions (binary treatment) Multi-valued treat-ments are not going to be covered in the experiments nor in the developed code butthey can be easily extended to cover these cases

A binary treatment is applied but its outcome value is continuous Different to themost common used case of four possibles scenarios in which usually just two can beobserved or measured All the experiments and the code developed can be appliedto discrete outputs but other machine learning techniques could be more suitable forthis type of predictions (classification algorithms) Also there are cases in which thetask is to predict weather to apply or not the treatment to an individual (also knownas unit or patient) To predict in this cases turns out more into a classification task inwhich a threshold on the interval confidence of predicting to affirmatively apply thetreatment is usually set and validated through trial and error against several contin-uous values to determine what would be the one that predicts with best accuracyThis is call as Policy Risk on causal inference literature

This case is more similar to real world scenarios where the data was observed andfinally a decision on applying or not the treatment (action) has to be made in orderto peruse a desired result

In this work the cases in which the dataset contains outcomes in a binary form topredict weather or not to apply the treatment will not be covered

5

Chapter 2

Background

201 Rubin-Newman Causal Model

The Rubin causal model (RCM) (Rubin 2005) also known as Rubin-Newman Poten-tial Outcomes framework is a extended statistical analysis frame to model observa-tional data that Donald Rubin developed He came up with the mentioned frame-work building it on top of the original Newman method that he developed in his1923 masters thesis extending it to non randomized controlled trials (observationaldata)

The Rubin-Newman potential outcomes framework consists in

xi isin X

with an effectively applied treatment

ti isin 0 1

The two possible potential outcomes are defined by

Y0(xi) Y1(xi) isin Y

Of one of them (the one which actually happened) we can observe its factual out-come

yiF= tiY1(xi) + (1 minus ti)Y0(xi)

And let (x1 t1 y1F) (xn tn yn

F) be a unit from the factual distribution

Consequently let (x1 1 minus t1 y1CF) (xn 1 minus tn yn

CF be the counterfactual sample

Notice that all the factual outcomes yF are known whereas is never the case in anyunit for the counterfactual outcomes yCF (except for testing phase and just when thedataset is semi-synthetic or synthetic)

It will be used as interchangeable terms the expressions yF or y f t referring to factualobserved outcomes while yFC or yc f t will be pointing to counterfactual outcomes

6 Chapter 2 Background

202 The fundamental problem of causal analysis

The fundamental problem of causal analysis states that it is impossible given a unit xand assigning either the treatment t = 1 or t = 0 to that unit to observe the counterfactual outcome E[Y0∣x t = 1] or E[Y1∣x t = 0] (what would have happen or whatwould have been the outcome if the other treatment would have been given to theunit x )

However it is always possible to observe the outcome of the effectively appliedtreatment t which is represented as E[Y0∣x t = 0] or E[Y1∣x t = 1] or in shorterterms Y0 or Y1

In this dissertation the focus is on the case when the causal graph is simple andknown to be of the form (Y1 Y0) lt- x -gt t with no hidden confounders

This problem can be extended as most of the problems and applications discussedunder the Rubin-Newman Potential Outcomes Frameworks in this dissertation as amulti-treatment experiment It is important to notice that the problem of not havingaccess to the result of the counter factual outcome Yc f is even worst when extendingthis problem to multi-treatment experiments since the missing values that mattersfor better Individual Treatment Effects are increased in the total order of possibletreatments except the one applied

203 Metrics for Causality

Three well-known metrics in the causality field are reported for each implementedmachine learning technique applied

The losses that will be reported are

bull εITE Error of Individual Treatment Effect - also known as the ConditionalAverage Treatment Effect (CATE) - and it is how well or bad perform the treat-ment on one particular patient

ITE(x) ∶= E[y∣X = x t = 1] minus E[y∣X = x t = 0] = E[Yx1 minusYx0]

bull εATE Error of Average Treatment Effect as it name describes it it representthe effect that the applied treatment either t = 0 or t = 1 depending on thewhole population effectively had Note bold that as an average it can be notthe best solution to treat a new patient with this treatment since its unique char-acteristics as a unit might make them experience wrong results or no results atall

ATE ∶= E[ITE(x)] = E[δ] = E[Y1 minusY0]forallx isin X

bull εPEHE Precision in Estimation of Heterogeneous Effect is used to measure theprecision trade-off between the Individual Treatment Effect and the AverageTreatment Effect It is important to notice that this metric relates the ATE andITE predictions penalizing the predictions that had been predicted right for onemeasure but wrong or not that accurate for the other one

PEHE(x) ∶= 1N

N

sumi=1

((yi1 minus yi0) minus (yi1 minus yi0))2

Chapter 2 Background 7

204 Assumptions

To work on the results three important assumptions under the Rubin-Newmancausal Framework shall be made

bull Consistency For each unit just one of the two potential outcomes can be ob-served Hence if t = 0 then y = Y0 will be the observed outcome or factual(yF) However if the applied treatment was t = 1 afterwards y = Y1 will bethe available observed outcome or factual yF

bull Strong Ignorability Also known as no unmeasured confounders this assump-tion can be stated by (Y1 Y0) ⫫ t∣x and 0 lt p(t = 1∣x) lt 1forallx It is importantto notice that to be able to state this assumption a domain knowledge expertwould have to assess the dataset and therefore determine if there are no un-measured confounders That is the case for the dataset implemented in thiswork

bull Common Support This assumption states that for each unit x isin X there is apositive probability of being both treated (t = 1) and untreated (t = 0)

0 lt P(t = 1∣x) lt 1

205 Definitions

In causal inference from observational data several terms are used interchangeablyand might confuse the reader

This subsection should be clear before going further into this dissertation

Some common synonyms are

bull unit is the subject of the analysis the one that will be applied the treatmentpatient individual input xi xi isin X

bull covariates all the collected (observed) variable that have a direct effect on theoutcomefeatures(ML) x x isin X

bull treatment the possible different actions that can be applied to a unit Usuallybinary but can be multi-valued under the Rubin-Newman Potential OutcomesFrameworkaction t t isin 0 1 t isin 0 N

bull Outcome the measured result of applying a treatment t to a unit xobserved outcome result factual Y factual y f = yF

bull Counterfactual what would have been the result if the opposite treatment tothe effectively applied would have been applied to a unitunobserved outcome yc f t yc f YCF

206 Related Work

Potential outcomes are the framework to mathematically describe causality and coun-terfactuals (Rubin 1978)

8 Chapter 2 Background

Causality from observational data can be clearly applicable to a wide range of in-dustries eg advertisement placement selection health care systems finance oreven to improve education (Recursive Partitioning for Heterogeneous Causal Effects Hoiles and Van Der Schaar 2016 Bottou et al 2013) In particular counterfac-

tual inference in observational studies has been a topic of interested study in eco-nomics statistics health care pharmaceutical companies epidemiology and sociol-ogy (Causal Inference Using Potential Outcomes Design Modeling Decisions Morganand Winship 2014 Causal Inference Using Potential Outcomes Design Modeling De-cisions Chernozhukov et al 2016) whereas in machine learning the attention hasbeen caught not less than a decade ago (Lang 1995 Bottou et al 2013 Swaminathanand Joachims 2015a) A lot of work in machine learning had been targeted for dis-covering the underlying causal graph from collected data (Nonlinear causal discoverywith additive noise models Maathuis et al 2010 Triantafillou and Tsamardinos 2015Mooij et al 2016)

causal inference for counterfactual predictions is usually grouped by parametricnon-parametric and doubly robust methods

For parametric methods causal inference the relationships within features and ac-tions pairs and rewards by implementing one or more parameters trying to specifi-cally model the relations within context outcomes and actions (treatments) In thesemethods linear and logistic regression (Prentice 1976 Gelman and Hill 2007) ran-dom forests (Wager and Athey 2015) and regression trees (Chipman George andMcCulloch 2010) had been used in the past to complete the task For example (Wa-ger and Athey 2017) estimates ITEs by causal Forests but their asymptotic estimatesin datasets with a large number of relevant features has limitations that needs to beaddressed in future work

In non-parametric approaches the counterfactual predictions are mostly calculatedthrough a propensity score matching and re-weighting (Joachims and Swaminathan2016 Austin 2011 Rosenbaum and Rubin 1983 Rosenbaum 2002) To performdoubly robust causality is done by merging parametric and non-parametric methods(Dudik Langford and Li 2011 Jiang and Li 2015)

Double robust methods are known for merging the characteristics of both methodsA common example of this application would be propensity score weighted regres-sion (Bang and Robins 2005 Dudik Langford and Li 2011) When the treatmentassignment probability is known this method models the problem particularly welleg in off-policy evaluation or learning from bandits However in most of the casesin observational data their efficiency drops dramatically (Kang and Schafer 2007)

Machine learning for predicting Individual Treatment effects has been arisen a lotof interest during the last two years through the development of custom metricfunctions -as long as the application of other techniques on causality- with specialfocus on unbalanced treatment application datasets This refers to the sub area ofcausality which is known as causal inference from observational data Observationaldata is data that has been or is collected without the possibility of design and run aproper Randomized Controlled Trial The creation of custom distance learning met-rics and custom loss functions applied to Neural Networks had brought interesting

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 2: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

iii

Declaration of AuthorshipI Hernaacuten E BORREacute declare that this thesis titled ldquoMachine Learning for causalInference on Observational Datardquo and the work presented in it are my own I confirmthat

bull This work was done wholly or mainly while in candidature for a research de-gree at this University

bull Where any part of this thesis has previously been submitted for a degree orany other qualification at this University or any other institution this has beenclearly stated

bull Where I have consulted the published work of others this is always clearlyattributed

bull Where I have quoted from the work of others the source is always given Withthe exception of such quotations this thesis is entirely my own work

bull I have acknowledged all main sources of help

bull Where the thesis is based on work done by myself jointly with others I havemade clear exactly what was done by others and what I have contributed my-self

SignedHernaacuten Emilio Borreacute

Date28082018

v

ldquoThirty years ago we used to ask Can a computer simulate all processes of logic The answerwas yes but the question was surely wrong We should have asked Can logic simulate allsequences of cause and effect And the answer would have been no rdquo

Gregory Bateson Mind and Nature

vii

UNIVERSITY OF ESSEX

Abstract

Faculty Name

School of Computer Science and Electronic Engineering

Master of Science in Artificial Intelligence

Machine Learning for causal Inference on Observational Data

by Hernaacuten E BORREacute

The established scientific way to make claims about cause and effect is to performa Randomized Controlled Trial (RCT) However although RCTs are the best way todetermine causal effects the chances to perform such rigorous scientific experimentsis most often either impossible or unethical The Average Treatment Effect (ATE) isusually the outcome of the RCT experiments and this outcome is ideally proof of aneffect under the studied population which hopefully extends to other individualsIn contrast it is most common to find Observational Data in which the data that hasbeen collected might be heavily unbalanced for treatment assignments or the pa-tients covariates might come from completely different distributions Neverthelessthe ultimate goal of causal effects is to find the specific Individual Treatment Effect(ITE) for each patient Identifying the Individual Treatment Effect is a topic thathas always been important in the field of causality especially within the machinelearning community

Applications of such predictions are related with medicine but can be extensivelyused in financial investments advertisement placements recommender systems forretail and social sciences and beyond

The ability to learn complex non-linear relationships of some machine learning algo-rithms have been trying to detect and predict policies in which given the particularfeatures of an individual (patient) the algorithms could determine whether or not toapply the treatment to them

In this thesis the ITE will be predicted using a benchmark semi synthetic-datasetwhich has been unbalanced Assuming strong ignorability alternative machine learn-ing techniques that had not been tested in past publications will be applied to predictthe ITE from observational data The results obtained are compared with state-of-the-art outcomes some of the algorithms applied in this work performed similarlyto more complex custom designed methods

In addition a full review of all recent literature in the machine learning applied tocausal inference has been done

ix

AcknowledgementsFirst and foremost I would like to thank my supervisor Dr Spyros Samothrakis Hehas been an essential pillar in the whole process of this dissertation and my careergiving me not only academic professional support but also encouraging me to makethe most out of this process

Second I would like to thanks my parents who had always been listening to myissues stress periods and emotional crisis and cheered me up every time I neededthem Without them I would not be able to be here writing this dissertation by anychance They made me believe that everything is possible in life if you try hardenough and you are a honest person Infinitely grateful to them forever

Third to my former university Professors from Universidad Tecnoloacutegica NacionalFactual Regional Buenos Aires Dr Oscar Bruno Dr Alejandro Prince and DraMariacutea Florencia Pollo-Cataneo for their recommendation letters support throughthe year and enlightenment in my professional career giving me always the bestadvice I could always get

Fourth I would like to thanks Dr Uri Shalit who offered me immediate help andadvise on this dissertationrsquos topic and who also helped me on the full IHDP datasetcollection metrics for benchmark comparisons

Fifth to all my classmates (some of them friends now) who spent with me count-less hours discussing about our passion making the world a better place throughMachine Learning and Artificial Intelligence

Last but not least I would like to thanks the Government of Argentina and the Ar-gentinian Ministry of Education for giving me the chance of coming to study to oneof the best universities in the world throughout the BECAR scholarship

xi

Contents

Declaration of Authorship iii

Abstract vii

Acknowledgements ix

1 Introduction 111 Motivation 112 Purpose and Research Question 213 Approach and Methodology 314 Scope and Limitation 4

2 Background 5201 Rubin-Newman Causal Model 5202 The fundamental problem of causal analysis 6203 Metrics for Causality 6204 Assumptions 7205 Definitions 7206 Related Work 7

21 Machine Learning 10211 Ordinary Least Squares (Linear Regression) 10212 Ridge Regression 11213 Support Vector Regressor 11214 Bayesian Ridge 11215 Lasso 12216 Lasso Lars 12217 ARD Regression 12218 Passive Aggressive Regressor 12219 Theil Sen Regressor 122110 K-Neighbors Regressor 132111 Logistic Regression 13

3 Methodology 1531 Dataset 1532 IHDP dataset 1633 Other articles metrics 17

4 Experiments 1941 Machine learning methods applied to IHDP dataset 2042 Other experiments 27

421 Recursive Feature Elimination 27422 Domain Adaptation Neural Networks 27

43 Discussion 27

xii

5 Conclusions 2951 Concluding Remarks 2952 Future work 30

Bibliography 31

xiii

List of Tables

41 IHDP 10 replications with traditional machine learning algorithms -Within sample 20

42 IHDP 10 replications with traditional machine learning algorithms -Out-of-sample 21

43 IHDP 100 replications - Within sample 2144 IHDP 100 replications - Out-of-sample 2145 IHDP 100 replications already split dataset - Within sample 2246 IHDP 100 replications already split dataset - Out-of-sample 2247 IHDP 100 replications - No scaling - Within sample 2348 IHDP 1000 replications - No Scaling - Out-of-sample 2349 IHDP 100 replications - Scaled - Within sample 23410 IHDP 1000 replications - No Scaling - Out-of-sample 24411 IHDP 100 replications logistic regressions - Within sample 24412 IHDP 100 replications logistic regressions - Out-of-sample 24413 IHDP 100 replications SVR Hyper-parameters tunning - Within sample 25414 IHDP 100 replications SVR Hyper-parameters tunning - Out-of-sample 25415 ICML 2017 - Estimating individual treatment effect generalization

bounds and algorithms (Shalit Johansson and Sontag 2017) 26416 ICML 2017 - Estimating individual treatment effect generalization

bounds and algorithms (Shalit Johansson and Sontag 2017) 26417 Domain Adaptation Neural Networks 27

xv

List of Abbreviations

ML Machine LearningSVR Support Vector RegressorRL Reinforcement LearningNN Neural NetworksLR Linear RegressionKNN K Nearest NeighboursRCE Randomized Controlled ExperimentITE Individual Treatment EffectATE Avarage Treatment EffectPEHE Precision in Estimation of Heterogenous EffectsCATE Conditional Average Treatment EffectRCM Rubin Casual Model

1

Chapter 1

Introduction

11 Motivation

Causality is often confused with correlation Correlation does not imply causationThese inferences are often called spurious correlations and they often confuses theinference process in which humans make decisions

A common definition of Causality is still not agreed by the scientific communitynowadays

The proven scientific way to make claims about cause and effect it is to performwhat is called a Randomized Controlled Trial(RCT) In a Randomized ControlledTrial a statistically representative portion of the population that will be participatingof the experiment (trial) are exposed to a treatment(action) which could be eitherpositive - apply the treatment - or neutral (control) -giving the patient a placebo ornot treating the patient(unit) at all

All these concepts are related to medical words since the field in which RCTs areapplied the most is medical trials However it is not the only industry in whichthese concept of dragging conclusions from a trial can be done For example it iswidely used in social studies but can also be applied to make decisions on buyingselling or holding a particular stock or displaying an advertisement that generatemore sales than the others in the advertisements industry

Nevertheless the Randomized Controlled Trials are the best way to detect causaleffects the possibility of perform such scientific rigorous experiments is most ofthe times either impossible or unethical An example could be seem when tryingto detect if driving while being under the effects of alcohol can affect (or not) thedriverrsquos skills Another clear example of this is determining the causes of smokingin teenagers or young people in which to perform a RCT would involve to take twogroups of non-smoking teenagers make half of the units smoke for several years - itdepends on the experiment or the research question - and then determine if smokingin young people would cause something or not after that period As the readercan infer there are clearly ethical problems associated with performing full RCTs todetermine causes and effects

It is also important to let clear that there are cases in which it would be impossi-ble to analyze previously collected data without having in mind the RCT methodThese cases are known as observational data Observational data is defined as infor-mation that can be obtained from previously collected situations in which a formal

2 Chapter 1 Introduction

randomized trial method has not been applied but it is still important to try to de-termine causes and effects from that data This is the case in most organizations inthe present since they would possibly be collecting massive amounts of data duringthe last decades but they could not or would not be able to establish a RCT processduring collection of the metrics Moreover sometimes the data collection processhappens in a non randomized controlled experiment For example a not so com-mon disease that just affects a small percentage of the population might happen toappear within a wide range of people that makes the inference process difficult

In the past causal Inference methods have been a Statisticians only field How-ever with recent advances of machine learning algorithms more computer scientistsand machine learning engineers have been trying to infer causes and relationshipsthrough traditional and new machine learning techniques

The ability to learn complex non-linear relationships of some machine learning algo-rithms have been trying to detect and predict policies in which given the particularfeatures of an individual (patient) the algorithm could determine if to apply or notthe treatment (action) to them This concept is also known as Individual Treatmenteffect estimation or it could also be referred as Policy risk when predicting a binaryclass which is is to apply or not the treatment through the discovery of a certainthreshold

It is a matter of interest to be able to predict the individual (customized) treatmenteffect because this would lead to better decisions(actions or treatments) specificallyshaped for each person and not only relaying on the average of the whole studiedpopulation

The ultimately motivation of this work is to be able to predict the IndividualTreatment Effects for new patients with the previously collected data using al-ternative machine learning techniques to the ones used in past research effortsMoreover the compiling of a literature that can be understood by more computerscientist will be tried to present as well as running code of all the experiments per-formed will be released for others to build upon it

12 Purpose and Research Question

The purpose of this dissertation is to predict the Individual Treatment Effect (ITE)the Average Treatment Effect (ATE) and the Precision in Estimation of Heteroge-neous Effect (PEHE) for a widely adopted benchmark dataset in the field usuallyrefer to this as IHDP (Infant Health Developed Program) which is a semi-syntheticdataset particularly unbalanced and created for the task of causal inference on ob-servational data(Hill 2011)

The research experiments are about trying alternative machine learning methodswithout adding extra complexity custom error loss or custom metric functions whilelearning and predicting would be able to obtain similar or even better results thanthe state-of-the-art metrics based on the exact same benchmark dataset

13 Approach and Methodology 3

13 Approach and Methodology

For causality analysis purposes the Rubin Potential Outcomes Framework (CausalInference Using Potential Outcomes Design Modeling Decisions) and its notation willbe used during this whole thesis This model is also known as the Newman-Rubincausal model and it is an approach to statistical analysis of cause and effect based onthe potential outcomes framework

For the machine learning experiments the latest version of the scikit-learn(Userguide contents mdash scikit-learn 0192 documentation) python framework had been usedAll their underlaying methods and default hyper-parameters had been used Alsothe mathematical notation of their documentation will be presented to describe eachalgorithmrsquos functions and limitations

The different algorithms had been tested with a benchmark standard dataset for causalinference from observational data Infant Health and Development Program (IHDP)introduced by (Hill 2011) as a semi synthetic-dataset based on real features ob-tained from a real an observational study (Gross 1993) Replications on this datasethad been created to get 10 100 and 1000 cases to be able to train and predict the ma-chine learning models on them and get the desired metric error results afterwards

It is important to notice that the testing method and the metrics used to determinethe effectiveness of each algorithm are different from the ones normally used to testmachine learning algorithms and testing these algorithms within a causal frame-work differs substantially from the usual traintest paradigm of the machine learn-ing field

Since a synthetic-dataset has been used the real Individual Treatment Effect is avail-able to perform testing metrics Therefore the experiment results will consist ofthe performance of each algorithm based on Individual Treatment Effect AverageTreatment Effect (ATE) and Precision in Estimation of Heterogeneous Effect (PEHE)These three are the metrics displayed in the Experiments chapter for each algorithmtrained A detailed explanation of these formulas will be found in the next sections203

Also it is important to notice that the machine learning algorithms are trained justusing the treatment applied (observed) the features (covariates in causal inferenceliterature) and the observed outcome (usually known as Y or Y factual) Aftertraining an completely unseen dataset during is used for testing purposes

The already trained algorithm predicts just the Y factual based on the unit (alsoknown as patient) features (covariates) for the both the cases the unit would havetaken the treatment and likewise predicts the outcomes as if the unit would havetaken the control treatment Once both outcomes are predicted (Y factual and Ycounterfactual the ATE ITE and PEHE metrics are calculated In addition an aver-age score and its deviation for each run of the 10 100 and 1000 replications of theIHDP had been run to evaluate these errors in bigger simulated scenarios

The mathematical notation will be kept as minimum as possible to not confuse thereader with unnecessary information

4 Chapter 1 Introduction

14 Scope and Limitation

In this document the outcomes of the applied treatment to a patient will be onlyanalyzed with respect two possible actions (binary treatment) Multi-valued treat-ments are not going to be covered in the experiments nor in the developed code butthey can be easily extended to cover these cases

A binary treatment is applied but its outcome value is continuous Different to themost common used case of four possibles scenarios in which usually just two can beobserved or measured All the experiments and the code developed can be appliedto discrete outputs but other machine learning techniques could be more suitable forthis type of predictions (classification algorithms) Also there are cases in which thetask is to predict weather to apply or not the treatment to an individual (also knownas unit or patient) To predict in this cases turns out more into a classification task inwhich a threshold on the interval confidence of predicting to affirmatively apply thetreatment is usually set and validated through trial and error against several contin-uous values to determine what would be the one that predicts with best accuracyThis is call as Policy Risk on causal inference literature

This case is more similar to real world scenarios where the data was observed andfinally a decision on applying or not the treatment (action) has to be made in orderto peruse a desired result

In this work the cases in which the dataset contains outcomes in a binary form topredict weather or not to apply the treatment will not be covered

5

Chapter 2

Background

201 Rubin-Newman Causal Model

The Rubin causal model (RCM) (Rubin 2005) also known as Rubin-Newman Poten-tial Outcomes framework is a extended statistical analysis frame to model observa-tional data that Donald Rubin developed He came up with the mentioned frame-work building it on top of the original Newman method that he developed in his1923 masters thesis extending it to non randomized controlled trials (observationaldata)

The Rubin-Newman potential outcomes framework consists in

xi isin X

with an effectively applied treatment

ti isin 0 1

The two possible potential outcomes are defined by

Y0(xi) Y1(xi) isin Y

Of one of them (the one which actually happened) we can observe its factual out-come

yiF= tiY1(xi) + (1 minus ti)Y0(xi)

And let (x1 t1 y1F) (xn tn yn

F) be a unit from the factual distribution

Consequently let (x1 1 minus t1 y1CF) (xn 1 minus tn yn

CF be the counterfactual sample

Notice that all the factual outcomes yF are known whereas is never the case in anyunit for the counterfactual outcomes yCF (except for testing phase and just when thedataset is semi-synthetic or synthetic)

It will be used as interchangeable terms the expressions yF or y f t referring to factualobserved outcomes while yFC or yc f t will be pointing to counterfactual outcomes

6 Chapter 2 Background

202 The fundamental problem of causal analysis

The fundamental problem of causal analysis states that it is impossible given a unit xand assigning either the treatment t = 1 or t = 0 to that unit to observe the counterfactual outcome E[Y0∣x t = 1] or E[Y1∣x t = 0] (what would have happen or whatwould have been the outcome if the other treatment would have been given to theunit x )

However it is always possible to observe the outcome of the effectively appliedtreatment t which is represented as E[Y0∣x t = 0] or E[Y1∣x t = 1] or in shorterterms Y0 or Y1

In this dissertation the focus is on the case when the causal graph is simple andknown to be of the form (Y1 Y0) lt- x -gt t with no hidden confounders

This problem can be extended as most of the problems and applications discussedunder the Rubin-Newman Potential Outcomes Frameworks in this dissertation as amulti-treatment experiment It is important to notice that the problem of not havingaccess to the result of the counter factual outcome Yc f is even worst when extendingthis problem to multi-treatment experiments since the missing values that mattersfor better Individual Treatment Effects are increased in the total order of possibletreatments except the one applied

203 Metrics for Causality

Three well-known metrics in the causality field are reported for each implementedmachine learning technique applied

The losses that will be reported are

bull εITE Error of Individual Treatment Effect - also known as the ConditionalAverage Treatment Effect (CATE) - and it is how well or bad perform the treat-ment on one particular patient

ITE(x) ∶= E[y∣X = x t = 1] minus E[y∣X = x t = 0] = E[Yx1 minusYx0]

bull εATE Error of Average Treatment Effect as it name describes it it representthe effect that the applied treatment either t = 0 or t = 1 depending on thewhole population effectively had Note bold that as an average it can be notthe best solution to treat a new patient with this treatment since its unique char-acteristics as a unit might make them experience wrong results or no results atall

ATE ∶= E[ITE(x)] = E[δ] = E[Y1 minusY0]forallx isin X

bull εPEHE Precision in Estimation of Heterogeneous Effect is used to measure theprecision trade-off between the Individual Treatment Effect and the AverageTreatment Effect It is important to notice that this metric relates the ATE andITE predictions penalizing the predictions that had been predicted right for onemeasure but wrong or not that accurate for the other one

PEHE(x) ∶= 1N

N

sumi=1

((yi1 minus yi0) minus (yi1 minus yi0))2

Chapter 2 Background 7

204 Assumptions

To work on the results three important assumptions under the Rubin-Newmancausal Framework shall be made

bull Consistency For each unit just one of the two potential outcomes can be ob-served Hence if t = 0 then y = Y0 will be the observed outcome or factual(yF) However if the applied treatment was t = 1 afterwards y = Y1 will bethe available observed outcome or factual yF

bull Strong Ignorability Also known as no unmeasured confounders this assump-tion can be stated by (Y1 Y0) ⫫ t∣x and 0 lt p(t = 1∣x) lt 1forallx It is importantto notice that to be able to state this assumption a domain knowledge expertwould have to assess the dataset and therefore determine if there are no un-measured confounders That is the case for the dataset implemented in thiswork

bull Common Support This assumption states that for each unit x isin X there is apositive probability of being both treated (t = 1) and untreated (t = 0)

0 lt P(t = 1∣x) lt 1

205 Definitions

In causal inference from observational data several terms are used interchangeablyand might confuse the reader

This subsection should be clear before going further into this dissertation

Some common synonyms are

bull unit is the subject of the analysis the one that will be applied the treatmentpatient individual input xi xi isin X

bull covariates all the collected (observed) variable that have a direct effect on theoutcomefeatures(ML) x x isin X

bull treatment the possible different actions that can be applied to a unit Usuallybinary but can be multi-valued under the Rubin-Newman Potential OutcomesFrameworkaction t t isin 0 1 t isin 0 N

bull Outcome the measured result of applying a treatment t to a unit xobserved outcome result factual Y factual y f = yF

bull Counterfactual what would have been the result if the opposite treatment tothe effectively applied would have been applied to a unitunobserved outcome yc f t yc f YCF

206 Related Work

Potential outcomes are the framework to mathematically describe causality and coun-terfactuals (Rubin 1978)

8 Chapter 2 Background

Causality from observational data can be clearly applicable to a wide range of in-dustries eg advertisement placement selection health care systems finance oreven to improve education (Recursive Partitioning for Heterogeneous Causal Effects Hoiles and Van Der Schaar 2016 Bottou et al 2013) In particular counterfac-

tual inference in observational studies has been a topic of interested study in eco-nomics statistics health care pharmaceutical companies epidemiology and sociol-ogy (Causal Inference Using Potential Outcomes Design Modeling Decisions Morganand Winship 2014 Causal Inference Using Potential Outcomes Design Modeling De-cisions Chernozhukov et al 2016) whereas in machine learning the attention hasbeen caught not less than a decade ago (Lang 1995 Bottou et al 2013 Swaminathanand Joachims 2015a) A lot of work in machine learning had been targeted for dis-covering the underlying causal graph from collected data (Nonlinear causal discoverywith additive noise models Maathuis et al 2010 Triantafillou and Tsamardinos 2015Mooij et al 2016)

causal inference for counterfactual predictions is usually grouped by parametricnon-parametric and doubly robust methods

For parametric methods causal inference the relationships within features and ac-tions pairs and rewards by implementing one or more parameters trying to specifi-cally model the relations within context outcomes and actions (treatments) In thesemethods linear and logistic regression (Prentice 1976 Gelman and Hill 2007) ran-dom forests (Wager and Athey 2015) and regression trees (Chipman George andMcCulloch 2010) had been used in the past to complete the task For example (Wa-ger and Athey 2017) estimates ITEs by causal Forests but their asymptotic estimatesin datasets with a large number of relevant features has limitations that needs to beaddressed in future work

In non-parametric approaches the counterfactual predictions are mostly calculatedthrough a propensity score matching and re-weighting (Joachims and Swaminathan2016 Austin 2011 Rosenbaum and Rubin 1983 Rosenbaum 2002) To performdoubly robust causality is done by merging parametric and non-parametric methods(Dudik Langford and Li 2011 Jiang and Li 2015)

Double robust methods are known for merging the characteristics of both methodsA common example of this application would be propensity score weighted regres-sion (Bang and Robins 2005 Dudik Langford and Li 2011) When the treatmentassignment probability is known this method models the problem particularly welleg in off-policy evaluation or learning from bandits However in most of the casesin observational data their efficiency drops dramatically (Kang and Schafer 2007)

Machine learning for predicting Individual Treatment effects has been arisen a lotof interest during the last two years through the development of custom metricfunctions -as long as the application of other techniques on causality- with specialfocus on unbalanced treatment application datasets This refers to the sub area ofcausality which is known as causal inference from observational data Observationaldata is data that has been or is collected without the possibility of design and run aproper Randomized Controlled Trial The creation of custom distance learning met-rics and custom loss functions applied to Neural Networks had brought interesting

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 3: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

v

ldquoThirty years ago we used to ask Can a computer simulate all processes of logic The answerwas yes but the question was surely wrong We should have asked Can logic simulate allsequences of cause and effect And the answer would have been no rdquo

Gregory Bateson Mind and Nature

vii

UNIVERSITY OF ESSEX

Abstract

Faculty Name

School of Computer Science and Electronic Engineering

Master of Science in Artificial Intelligence

Machine Learning for causal Inference on Observational Data

by Hernaacuten E BORREacute

The established scientific way to make claims about cause and effect is to performa Randomized Controlled Trial (RCT) However although RCTs are the best way todetermine causal effects the chances to perform such rigorous scientific experimentsis most often either impossible or unethical The Average Treatment Effect (ATE) isusually the outcome of the RCT experiments and this outcome is ideally proof of aneffect under the studied population which hopefully extends to other individualsIn contrast it is most common to find Observational Data in which the data that hasbeen collected might be heavily unbalanced for treatment assignments or the pa-tients covariates might come from completely different distributions Neverthelessthe ultimate goal of causal effects is to find the specific Individual Treatment Effect(ITE) for each patient Identifying the Individual Treatment Effect is a topic thathas always been important in the field of causality especially within the machinelearning community

Applications of such predictions are related with medicine but can be extensivelyused in financial investments advertisement placements recommender systems forretail and social sciences and beyond

The ability to learn complex non-linear relationships of some machine learning algo-rithms have been trying to detect and predict policies in which given the particularfeatures of an individual (patient) the algorithms could determine whether or not toapply the treatment to them

In this thesis the ITE will be predicted using a benchmark semi synthetic-datasetwhich has been unbalanced Assuming strong ignorability alternative machine learn-ing techniques that had not been tested in past publications will be applied to predictthe ITE from observational data The results obtained are compared with state-of-the-art outcomes some of the algorithms applied in this work performed similarlyto more complex custom designed methods

In addition a full review of all recent literature in the machine learning applied tocausal inference has been done

ix

AcknowledgementsFirst and foremost I would like to thank my supervisor Dr Spyros Samothrakis Hehas been an essential pillar in the whole process of this dissertation and my careergiving me not only academic professional support but also encouraging me to makethe most out of this process

Second I would like to thanks my parents who had always been listening to myissues stress periods and emotional crisis and cheered me up every time I neededthem Without them I would not be able to be here writing this dissertation by anychance They made me believe that everything is possible in life if you try hardenough and you are a honest person Infinitely grateful to them forever

Third to my former university Professors from Universidad Tecnoloacutegica NacionalFactual Regional Buenos Aires Dr Oscar Bruno Dr Alejandro Prince and DraMariacutea Florencia Pollo-Cataneo for their recommendation letters support throughthe year and enlightenment in my professional career giving me always the bestadvice I could always get

Fourth I would like to thanks Dr Uri Shalit who offered me immediate help andadvise on this dissertationrsquos topic and who also helped me on the full IHDP datasetcollection metrics for benchmark comparisons

Fifth to all my classmates (some of them friends now) who spent with me count-less hours discussing about our passion making the world a better place throughMachine Learning and Artificial Intelligence

Last but not least I would like to thanks the Government of Argentina and the Ar-gentinian Ministry of Education for giving me the chance of coming to study to oneof the best universities in the world throughout the BECAR scholarship

xi

Contents

Declaration of Authorship iii

Abstract vii

Acknowledgements ix

1 Introduction 111 Motivation 112 Purpose and Research Question 213 Approach and Methodology 314 Scope and Limitation 4

2 Background 5201 Rubin-Newman Causal Model 5202 The fundamental problem of causal analysis 6203 Metrics for Causality 6204 Assumptions 7205 Definitions 7206 Related Work 7

21 Machine Learning 10211 Ordinary Least Squares (Linear Regression) 10212 Ridge Regression 11213 Support Vector Regressor 11214 Bayesian Ridge 11215 Lasso 12216 Lasso Lars 12217 ARD Regression 12218 Passive Aggressive Regressor 12219 Theil Sen Regressor 122110 K-Neighbors Regressor 132111 Logistic Regression 13

3 Methodology 1531 Dataset 1532 IHDP dataset 1633 Other articles metrics 17

4 Experiments 1941 Machine learning methods applied to IHDP dataset 2042 Other experiments 27

421 Recursive Feature Elimination 27422 Domain Adaptation Neural Networks 27

43 Discussion 27

xii

5 Conclusions 2951 Concluding Remarks 2952 Future work 30

Bibliography 31

xiii

List of Tables

41 IHDP 10 replications with traditional machine learning algorithms -Within sample 20

42 IHDP 10 replications with traditional machine learning algorithms -Out-of-sample 21

43 IHDP 100 replications - Within sample 2144 IHDP 100 replications - Out-of-sample 2145 IHDP 100 replications already split dataset - Within sample 2246 IHDP 100 replications already split dataset - Out-of-sample 2247 IHDP 100 replications - No scaling - Within sample 2348 IHDP 1000 replications - No Scaling - Out-of-sample 2349 IHDP 100 replications - Scaled - Within sample 23410 IHDP 1000 replications - No Scaling - Out-of-sample 24411 IHDP 100 replications logistic regressions - Within sample 24412 IHDP 100 replications logistic regressions - Out-of-sample 24413 IHDP 100 replications SVR Hyper-parameters tunning - Within sample 25414 IHDP 100 replications SVR Hyper-parameters tunning - Out-of-sample 25415 ICML 2017 - Estimating individual treatment effect generalization

bounds and algorithms (Shalit Johansson and Sontag 2017) 26416 ICML 2017 - Estimating individual treatment effect generalization

bounds and algorithms (Shalit Johansson and Sontag 2017) 26417 Domain Adaptation Neural Networks 27

xv

List of Abbreviations

ML Machine LearningSVR Support Vector RegressorRL Reinforcement LearningNN Neural NetworksLR Linear RegressionKNN K Nearest NeighboursRCE Randomized Controlled ExperimentITE Individual Treatment EffectATE Avarage Treatment EffectPEHE Precision in Estimation of Heterogenous EffectsCATE Conditional Average Treatment EffectRCM Rubin Casual Model

1

Chapter 1

Introduction

11 Motivation

Causality is often confused with correlation Correlation does not imply causationThese inferences are often called spurious correlations and they often confuses theinference process in which humans make decisions

A common definition of Causality is still not agreed by the scientific communitynowadays

The proven scientific way to make claims about cause and effect it is to performwhat is called a Randomized Controlled Trial(RCT) In a Randomized ControlledTrial a statistically representative portion of the population that will be participatingof the experiment (trial) are exposed to a treatment(action) which could be eitherpositive - apply the treatment - or neutral (control) -giving the patient a placebo ornot treating the patient(unit) at all

All these concepts are related to medical words since the field in which RCTs areapplied the most is medical trials However it is not the only industry in whichthese concept of dragging conclusions from a trial can be done For example it iswidely used in social studies but can also be applied to make decisions on buyingselling or holding a particular stock or displaying an advertisement that generatemore sales than the others in the advertisements industry

Nevertheless the Randomized Controlled Trials are the best way to detect causaleffects the possibility of perform such scientific rigorous experiments is most ofthe times either impossible or unethical An example could be seem when tryingto detect if driving while being under the effects of alcohol can affect (or not) thedriverrsquos skills Another clear example of this is determining the causes of smokingin teenagers or young people in which to perform a RCT would involve to take twogroups of non-smoking teenagers make half of the units smoke for several years - itdepends on the experiment or the research question - and then determine if smokingin young people would cause something or not after that period As the readercan infer there are clearly ethical problems associated with performing full RCTs todetermine causes and effects

It is also important to let clear that there are cases in which it would be impossi-ble to analyze previously collected data without having in mind the RCT methodThese cases are known as observational data Observational data is defined as infor-mation that can be obtained from previously collected situations in which a formal

2 Chapter 1 Introduction

randomized trial method has not been applied but it is still important to try to de-termine causes and effects from that data This is the case in most organizations inthe present since they would possibly be collecting massive amounts of data duringthe last decades but they could not or would not be able to establish a RCT processduring collection of the metrics Moreover sometimes the data collection processhappens in a non randomized controlled experiment For example a not so com-mon disease that just affects a small percentage of the population might happen toappear within a wide range of people that makes the inference process difficult

In the past causal Inference methods have been a Statisticians only field How-ever with recent advances of machine learning algorithms more computer scientistsand machine learning engineers have been trying to infer causes and relationshipsthrough traditional and new machine learning techniques

The ability to learn complex non-linear relationships of some machine learning algo-rithms have been trying to detect and predict policies in which given the particularfeatures of an individual (patient) the algorithm could determine if to apply or notthe treatment (action) to them This concept is also known as Individual Treatmenteffect estimation or it could also be referred as Policy risk when predicting a binaryclass which is is to apply or not the treatment through the discovery of a certainthreshold

It is a matter of interest to be able to predict the individual (customized) treatmenteffect because this would lead to better decisions(actions or treatments) specificallyshaped for each person and not only relaying on the average of the whole studiedpopulation

The ultimately motivation of this work is to be able to predict the IndividualTreatment Effects for new patients with the previously collected data using al-ternative machine learning techniques to the ones used in past research effortsMoreover the compiling of a literature that can be understood by more computerscientist will be tried to present as well as running code of all the experiments per-formed will be released for others to build upon it

12 Purpose and Research Question

The purpose of this dissertation is to predict the Individual Treatment Effect (ITE)the Average Treatment Effect (ATE) and the Precision in Estimation of Heteroge-neous Effect (PEHE) for a widely adopted benchmark dataset in the field usuallyrefer to this as IHDP (Infant Health Developed Program) which is a semi-syntheticdataset particularly unbalanced and created for the task of causal inference on ob-servational data(Hill 2011)

The research experiments are about trying alternative machine learning methodswithout adding extra complexity custom error loss or custom metric functions whilelearning and predicting would be able to obtain similar or even better results thanthe state-of-the-art metrics based on the exact same benchmark dataset

13 Approach and Methodology 3

13 Approach and Methodology

For causality analysis purposes the Rubin Potential Outcomes Framework (CausalInference Using Potential Outcomes Design Modeling Decisions) and its notation willbe used during this whole thesis This model is also known as the Newman-Rubincausal model and it is an approach to statistical analysis of cause and effect based onthe potential outcomes framework

For the machine learning experiments the latest version of the scikit-learn(Userguide contents mdash scikit-learn 0192 documentation) python framework had been usedAll their underlaying methods and default hyper-parameters had been used Alsothe mathematical notation of their documentation will be presented to describe eachalgorithmrsquos functions and limitations

The different algorithms had been tested with a benchmark standard dataset for causalinference from observational data Infant Health and Development Program (IHDP)introduced by (Hill 2011) as a semi synthetic-dataset based on real features ob-tained from a real an observational study (Gross 1993) Replications on this datasethad been created to get 10 100 and 1000 cases to be able to train and predict the ma-chine learning models on them and get the desired metric error results afterwards

It is important to notice that the testing method and the metrics used to determinethe effectiveness of each algorithm are different from the ones normally used to testmachine learning algorithms and testing these algorithms within a causal frame-work differs substantially from the usual traintest paradigm of the machine learn-ing field

Since a synthetic-dataset has been used the real Individual Treatment Effect is avail-able to perform testing metrics Therefore the experiment results will consist ofthe performance of each algorithm based on Individual Treatment Effect AverageTreatment Effect (ATE) and Precision in Estimation of Heterogeneous Effect (PEHE)These three are the metrics displayed in the Experiments chapter for each algorithmtrained A detailed explanation of these formulas will be found in the next sections203

Also it is important to notice that the machine learning algorithms are trained justusing the treatment applied (observed) the features (covariates in causal inferenceliterature) and the observed outcome (usually known as Y or Y factual) Aftertraining an completely unseen dataset during is used for testing purposes

The already trained algorithm predicts just the Y factual based on the unit (alsoknown as patient) features (covariates) for the both the cases the unit would havetaken the treatment and likewise predicts the outcomes as if the unit would havetaken the control treatment Once both outcomes are predicted (Y factual and Ycounterfactual the ATE ITE and PEHE metrics are calculated In addition an aver-age score and its deviation for each run of the 10 100 and 1000 replications of theIHDP had been run to evaluate these errors in bigger simulated scenarios

The mathematical notation will be kept as minimum as possible to not confuse thereader with unnecessary information

4 Chapter 1 Introduction

14 Scope and Limitation

In this document the outcomes of the applied treatment to a patient will be onlyanalyzed with respect two possible actions (binary treatment) Multi-valued treat-ments are not going to be covered in the experiments nor in the developed code butthey can be easily extended to cover these cases

A binary treatment is applied but its outcome value is continuous Different to themost common used case of four possibles scenarios in which usually just two can beobserved or measured All the experiments and the code developed can be appliedto discrete outputs but other machine learning techniques could be more suitable forthis type of predictions (classification algorithms) Also there are cases in which thetask is to predict weather to apply or not the treatment to an individual (also knownas unit or patient) To predict in this cases turns out more into a classification task inwhich a threshold on the interval confidence of predicting to affirmatively apply thetreatment is usually set and validated through trial and error against several contin-uous values to determine what would be the one that predicts with best accuracyThis is call as Policy Risk on causal inference literature

This case is more similar to real world scenarios where the data was observed andfinally a decision on applying or not the treatment (action) has to be made in orderto peruse a desired result

In this work the cases in which the dataset contains outcomes in a binary form topredict weather or not to apply the treatment will not be covered

5

Chapter 2

Background

201 Rubin-Newman Causal Model

The Rubin causal model (RCM) (Rubin 2005) also known as Rubin-Newman Poten-tial Outcomes framework is a extended statistical analysis frame to model observa-tional data that Donald Rubin developed He came up with the mentioned frame-work building it on top of the original Newman method that he developed in his1923 masters thesis extending it to non randomized controlled trials (observationaldata)

The Rubin-Newman potential outcomes framework consists in

xi isin X

with an effectively applied treatment

ti isin 0 1

The two possible potential outcomes are defined by

Y0(xi) Y1(xi) isin Y

Of one of them (the one which actually happened) we can observe its factual out-come

yiF= tiY1(xi) + (1 minus ti)Y0(xi)

And let (x1 t1 y1F) (xn tn yn

F) be a unit from the factual distribution

Consequently let (x1 1 minus t1 y1CF) (xn 1 minus tn yn

CF be the counterfactual sample

Notice that all the factual outcomes yF are known whereas is never the case in anyunit for the counterfactual outcomes yCF (except for testing phase and just when thedataset is semi-synthetic or synthetic)

It will be used as interchangeable terms the expressions yF or y f t referring to factualobserved outcomes while yFC or yc f t will be pointing to counterfactual outcomes

6 Chapter 2 Background

202 The fundamental problem of causal analysis

The fundamental problem of causal analysis states that it is impossible given a unit xand assigning either the treatment t = 1 or t = 0 to that unit to observe the counterfactual outcome E[Y0∣x t = 1] or E[Y1∣x t = 0] (what would have happen or whatwould have been the outcome if the other treatment would have been given to theunit x )

However it is always possible to observe the outcome of the effectively appliedtreatment t which is represented as E[Y0∣x t = 0] or E[Y1∣x t = 1] or in shorterterms Y0 or Y1

In this dissertation the focus is on the case when the causal graph is simple andknown to be of the form (Y1 Y0) lt- x -gt t with no hidden confounders

This problem can be extended as most of the problems and applications discussedunder the Rubin-Newman Potential Outcomes Frameworks in this dissertation as amulti-treatment experiment It is important to notice that the problem of not havingaccess to the result of the counter factual outcome Yc f is even worst when extendingthis problem to multi-treatment experiments since the missing values that mattersfor better Individual Treatment Effects are increased in the total order of possibletreatments except the one applied

203 Metrics for Causality

Three well-known metrics in the causality field are reported for each implementedmachine learning technique applied

The losses that will be reported are

bull εITE Error of Individual Treatment Effect - also known as the ConditionalAverage Treatment Effect (CATE) - and it is how well or bad perform the treat-ment on one particular patient

ITE(x) ∶= E[y∣X = x t = 1] minus E[y∣X = x t = 0] = E[Yx1 minusYx0]

bull εATE Error of Average Treatment Effect as it name describes it it representthe effect that the applied treatment either t = 0 or t = 1 depending on thewhole population effectively had Note bold that as an average it can be notthe best solution to treat a new patient with this treatment since its unique char-acteristics as a unit might make them experience wrong results or no results atall

ATE ∶= E[ITE(x)] = E[δ] = E[Y1 minusY0]forallx isin X

bull εPEHE Precision in Estimation of Heterogeneous Effect is used to measure theprecision trade-off between the Individual Treatment Effect and the AverageTreatment Effect It is important to notice that this metric relates the ATE andITE predictions penalizing the predictions that had been predicted right for onemeasure but wrong or not that accurate for the other one

PEHE(x) ∶= 1N

N

sumi=1

((yi1 minus yi0) minus (yi1 minus yi0))2

Chapter 2 Background 7

204 Assumptions

To work on the results three important assumptions under the Rubin-Newmancausal Framework shall be made

bull Consistency For each unit just one of the two potential outcomes can be ob-served Hence if t = 0 then y = Y0 will be the observed outcome or factual(yF) However if the applied treatment was t = 1 afterwards y = Y1 will bethe available observed outcome or factual yF

bull Strong Ignorability Also known as no unmeasured confounders this assump-tion can be stated by (Y1 Y0) ⫫ t∣x and 0 lt p(t = 1∣x) lt 1forallx It is importantto notice that to be able to state this assumption a domain knowledge expertwould have to assess the dataset and therefore determine if there are no un-measured confounders That is the case for the dataset implemented in thiswork

bull Common Support This assumption states that for each unit x isin X there is apositive probability of being both treated (t = 1) and untreated (t = 0)

0 lt P(t = 1∣x) lt 1

205 Definitions

In causal inference from observational data several terms are used interchangeablyand might confuse the reader

This subsection should be clear before going further into this dissertation

Some common synonyms are

bull unit is the subject of the analysis the one that will be applied the treatmentpatient individual input xi xi isin X

bull covariates all the collected (observed) variable that have a direct effect on theoutcomefeatures(ML) x x isin X

bull treatment the possible different actions that can be applied to a unit Usuallybinary but can be multi-valued under the Rubin-Newman Potential OutcomesFrameworkaction t t isin 0 1 t isin 0 N

bull Outcome the measured result of applying a treatment t to a unit xobserved outcome result factual Y factual y f = yF

bull Counterfactual what would have been the result if the opposite treatment tothe effectively applied would have been applied to a unitunobserved outcome yc f t yc f YCF

206 Related Work

Potential outcomes are the framework to mathematically describe causality and coun-terfactuals (Rubin 1978)

8 Chapter 2 Background

Causality from observational data can be clearly applicable to a wide range of in-dustries eg advertisement placement selection health care systems finance oreven to improve education (Recursive Partitioning for Heterogeneous Causal Effects Hoiles and Van Der Schaar 2016 Bottou et al 2013) In particular counterfac-

tual inference in observational studies has been a topic of interested study in eco-nomics statistics health care pharmaceutical companies epidemiology and sociol-ogy (Causal Inference Using Potential Outcomes Design Modeling Decisions Morganand Winship 2014 Causal Inference Using Potential Outcomes Design Modeling De-cisions Chernozhukov et al 2016) whereas in machine learning the attention hasbeen caught not less than a decade ago (Lang 1995 Bottou et al 2013 Swaminathanand Joachims 2015a) A lot of work in machine learning had been targeted for dis-covering the underlying causal graph from collected data (Nonlinear causal discoverywith additive noise models Maathuis et al 2010 Triantafillou and Tsamardinos 2015Mooij et al 2016)

causal inference for counterfactual predictions is usually grouped by parametricnon-parametric and doubly robust methods

For parametric methods causal inference the relationships within features and ac-tions pairs and rewards by implementing one or more parameters trying to specifi-cally model the relations within context outcomes and actions (treatments) In thesemethods linear and logistic regression (Prentice 1976 Gelman and Hill 2007) ran-dom forests (Wager and Athey 2015) and regression trees (Chipman George andMcCulloch 2010) had been used in the past to complete the task For example (Wa-ger and Athey 2017) estimates ITEs by causal Forests but their asymptotic estimatesin datasets with a large number of relevant features has limitations that needs to beaddressed in future work

In non-parametric approaches the counterfactual predictions are mostly calculatedthrough a propensity score matching and re-weighting (Joachims and Swaminathan2016 Austin 2011 Rosenbaum and Rubin 1983 Rosenbaum 2002) To performdoubly robust causality is done by merging parametric and non-parametric methods(Dudik Langford and Li 2011 Jiang and Li 2015)

Double robust methods are known for merging the characteristics of both methodsA common example of this application would be propensity score weighted regres-sion (Bang and Robins 2005 Dudik Langford and Li 2011) When the treatmentassignment probability is known this method models the problem particularly welleg in off-policy evaluation or learning from bandits However in most of the casesin observational data their efficiency drops dramatically (Kang and Schafer 2007)

Machine learning for predicting Individual Treatment effects has been arisen a lotof interest during the last two years through the development of custom metricfunctions -as long as the application of other techniques on causality- with specialfocus on unbalanced treatment application datasets This refers to the sub area ofcausality which is known as causal inference from observational data Observationaldata is data that has been or is collected without the possibility of design and run aproper Randomized Controlled Trial The creation of custom distance learning met-rics and custom loss functions applied to Neural Networks had brought interesting

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 4: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

vii

UNIVERSITY OF ESSEX

Abstract

Faculty Name

School of Computer Science and Electronic Engineering

Master of Science in Artificial Intelligence

Machine Learning for causal Inference on Observational Data

by Hernaacuten E BORREacute

The established scientific way to make claims about cause and effect is to performa Randomized Controlled Trial (RCT) However although RCTs are the best way todetermine causal effects the chances to perform such rigorous scientific experimentsis most often either impossible or unethical The Average Treatment Effect (ATE) isusually the outcome of the RCT experiments and this outcome is ideally proof of aneffect under the studied population which hopefully extends to other individualsIn contrast it is most common to find Observational Data in which the data that hasbeen collected might be heavily unbalanced for treatment assignments or the pa-tients covariates might come from completely different distributions Neverthelessthe ultimate goal of causal effects is to find the specific Individual Treatment Effect(ITE) for each patient Identifying the Individual Treatment Effect is a topic thathas always been important in the field of causality especially within the machinelearning community

Applications of such predictions are related with medicine but can be extensivelyused in financial investments advertisement placements recommender systems forretail and social sciences and beyond

The ability to learn complex non-linear relationships of some machine learning algo-rithms have been trying to detect and predict policies in which given the particularfeatures of an individual (patient) the algorithms could determine whether or not toapply the treatment to them

In this thesis the ITE will be predicted using a benchmark semi synthetic-datasetwhich has been unbalanced Assuming strong ignorability alternative machine learn-ing techniques that had not been tested in past publications will be applied to predictthe ITE from observational data The results obtained are compared with state-of-the-art outcomes some of the algorithms applied in this work performed similarlyto more complex custom designed methods

In addition a full review of all recent literature in the machine learning applied tocausal inference has been done

ix

AcknowledgementsFirst and foremost I would like to thank my supervisor Dr Spyros Samothrakis Hehas been an essential pillar in the whole process of this dissertation and my careergiving me not only academic professional support but also encouraging me to makethe most out of this process

Second I would like to thanks my parents who had always been listening to myissues stress periods and emotional crisis and cheered me up every time I neededthem Without them I would not be able to be here writing this dissertation by anychance They made me believe that everything is possible in life if you try hardenough and you are a honest person Infinitely grateful to them forever

Third to my former university Professors from Universidad Tecnoloacutegica NacionalFactual Regional Buenos Aires Dr Oscar Bruno Dr Alejandro Prince and DraMariacutea Florencia Pollo-Cataneo for their recommendation letters support throughthe year and enlightenment in my professional career giving me always the bestadvice I could always get

Fourth I would like to thanks Dr Uri Shalit who offered me immediate help andadvise on this dissertationrsquos topic and who also helped me on the full IHDP datasetcollection metrics for benchmark comparisons

Fifth to all my classmates (some of them friends now) who spent with me count-less hours discussing about our passion making the world a better place throughMachine Learning and Artificial Intelligence

Last but not least I would like to thanks the Government of Argentina and the Ar-gentinian Ministry of Education for giving me the chance of coming to study to oneof the best universities in the world throughout the BECAR scholarship

xi

Contents

Declaration of Authorship iii

Abstract vii

Acknowledgements ix

1 Introduction 111 Motivation 112 Purpose and Research Question 213 Approach and Methodology 314 Scope and Limitation 4

2 Background 5201 Rubin-Newman Causal Model 5202 The fundamental problem of causal analysis 6203 Metrics for Causality 6204 Assumptions 7205 Definitions 7206 Related Work 7

21 Machine Learning 10211 Ordinary Least Squares (Linear Regression) 10212 Ridge Regression 11213 Support Vector Regressor 11214 Bayesian Ridge 11215 Lasso 12216 Lasso Lars 12217 ARD Regression 12218 Passive Aggressive Regressor 12219 Theil Sen Regressor 122110 K-Neighbors Regressor 132111 Logistic Regression 13

3 Methodology 1531 Dataset 1532 IHDP dataset 1633 Other articles metrics 17

4 Experiments 1941 Machine learning methods applied to IHDP dataset 2042 Other experiments 27

421 Recursive Feature Elimination 27422 Domain Adaptation Neural Networks 27

43 Discussion 27

xii

5 Conclusions 2951 Concluding Remarks 2952 Future work 30

Bibliography 31

xiii

List of Tables

41 IHDP 10 replications with traditional machine learning algorithms -Within sample 20

42 IHDP 10 replications with traditional machine learning algorithms -Out-of-sample 21

43 IHDP 100 replications - Within sample 2144 IHDP 100 replications - Out-of-sample 2145 IHDP 100 replications already split dataset - Within sample 2246 IHDP 100 replications already split dataset - Out-of-sample 2247 IHDP 100 replications - No scaling - Within sample 2348 IHDP 1000 replications - No Scaling - Out-of-sample 2349 IHDP 100 replications - Scaled - Within sample 23410 IHDP 1000 replications - No Scaling - Out-of-sample 24411 IHDP 100 replications logistic regressions - Within sample 24412 IHDP 100 replications logistic regressions - Out-of-sample 24413 IHDP 100 replications SVR Hyper-parameters tunning - Within sample 25414 IHDP 100 replications SVR Hyper-parameters tunning - Out-of-sample 25415 ICML 2017 - Estimating individual treatment effect generalization

bounds and algorithms (Shalit Johansson and Sontag 2017) 26416 ICML 2017 - Estimating individual treatment effect generalization

bounds and algorithms (Shalit Johansson and Sontag 2017) 26417 Domain Adaptation Neural Networks 27

xv

List of Abbreviations

ML Machine LearningSVR Support Vector RegressorRL Reinforcement LearningNN Neural NetworksLR Linear RegressionKNN K Nearest NeighboursRCE Randomized Controlled ExperimentITE Individual Treatment EffectATE Avarage Treatment EffectPEHE Precision in Estimation of Heterogenous EffectsCATE Conditional Average Treatment EffectRCM Rubin Casual Model

1

Chapter 1

Introduction

11 Motivation

Causality is often confused with correlation Correlation does not imply causationThese inferences are often called spurious correlations and they often confuses theinference process in which humans make decisions

A common definition of Causality is still not agreed by the scientific communitynowadays

The proven scientific way to make claims about cause and effect it is to performwhat is called a Randomized Controlled Trial(RCT) In a Randomized ControlledTrial a statistically representative portion of the population that will be participatingof the experiment (trial) are exposed to a treatment(action) which could be eitherpositive - apply the treatment - or neutral (control) -giving the patient a placebo ornot treating the patient(unit) at all

All these concepts are related to medical words since the field in which RCTs areapplied the most is medical trials However it is not the only industry in whichthese concept of dragging conclusions from a trial can be done For example it iswidely used in social studies but can also be applied to make decisions on buyingselling or holding a particular stock or displaying an advertisement that generatemore sales than the others in the advertisements industry

Nevertheless the Randomized Controlled Trials are the best way to detect causaleffects the possibility of perform such scientific rigorous experiments is most ofthe times either impossible or unethical An example could be seem when tryingto detect if driving while being under the effects of alcohol can affect (or not) thedriverrsquos skills Another clear example of this is determining the causes of smokingin teenagers or young people in which to perform a RCT would involve to take twogroups of non-smoking teenagers make half of the units smoke for several years - itdepends on the experiment or the research question - and then determine if smokingin young people would cause something or not after that period As the readercan infer there are clearly ethical problems associated with performing full RCTs todetermine causes and effects

It is also important to let clear that there are cases in which it would be impossi-ble to analyze previously collected data without having in mind the RCT methodThese cases are known as observational data Observational data is defined as infor-mation that can be obtained from previously collected situations in which a formal

2 Chapter 1 Introduction

randomized trial method has not been applied but it is still important to try to de-termine causes and effects from that data This is the case in most organizations inthe present since they would possibly be collecting massive amounts of data duringthe last decades but they could not or would not be able to establish a RCT processduring collection of the metrics Moreover sometimes the data collection processhappens in a non randomized controlled experiment For example a not so com-mon disease that just affects a small percentage of the population might happen toappear within a wide range of people that makes the inference process difficult

In the past causal Inference methods have been a Statisticians only field How-ever with recent advances of machine learning algorithms more computer scientistsand machine learning engineers have been trying to infer causes and relationshipsthrough traditional and new machine learning techniques

The ability to learn complex non-linear relationships of some machine learning algo-rithms have been trying to detect and predict policies in which given the particularfeatures of an individual (patient) the algorithm could determine if to apply or notthe treatment (action) to them This concept is also known as Individual Treatmenteffect estimation or it could also be referred as Policy risk when predicting a binaryclass which is is to apply or not the treatment through the discovery of a certainthreshold

It is a matter of interest to be able to predict the individual (customized) treatmenteffect because this would lead to better decisions(actions or treatments) specificallyshaped for each person and not only relaying on the average of the whole studiedpopulation

The ultimately motivation of this work is to be able to predict the IndividualTreatment Effects for new patients with the previously collected data using al-ternative machine learning techniques to the ones used in past research effortsMoreover the compiling of a literature that can be understood by more computerscientist will be tried to present as well as running code of all the experiments per-formed will be released for others to build upon it

12 Purpose and Research Question

The purpose of this dissertation is to predict the Individual Treatment Effect (ITE)the Average Treatment Effect (ATE) and the Precision in Estimation of Heteroge-neous Effect (PEHE) for a widely adopted benchmark dataset in the field usuallyrefer to this as IHDP (Infant Health Developed Program) which is a semi-syntheticdataset particularly unbalanced and created for the task of causal inference on ob-servational data(Hill 2011)

The research experiments are about trying alternative machine learning methodswithout adding extra complexity custom error loss or custom metric functions whilelearning and predicting would be able to obtain similar or even better results thanthe state-of-the-art metrics based on the exact same benchmark dataset

13 Approach and Methodology 3

13 Approach and Methodology

For causality analysis purposes the Rubin Potential Outcomes Framework (CausalInference Using Potential Outcomes Design Modeling Decisions) and its notation willbe used during this whole thesis This model is also known as the Newman-Rubincausal model and it is an approach to statistical analysis of cause and effect based onthe potential outcomes framework

For the machine learning experiments the latest version of the scikit-learn(Userguide contents mdash scikit-learn 0192 documentation) python framework had been usedAll their underlaying methods and default hyper-parameters had been used Alsothe mathematical notation of their documentation will be presented to describe eachalgorithmrsquos functions and limitations

The different algorithms had been tested with a benchmark standard dataset for causalinference from observational data Infant Health and Development Program (IHDP)introduced by (Hill 2011) as a semi synthetic-dataset based on real features ob-tained from a real an observational study (Gross 1993) Replications on this datasethad been created to get 10 100 and 1000 cases to be able to train and predict the ma-chine learning models on them and get the desired metric error results afterwards

It is important to notice that the testing method and the metrics used to determinethe effectiveness of each algorithm are different from the ones normally used to testmachine learning algorithms and testing these algorithms within a causal frame-work differs substantially from the usual traintest paradigm of the machine learn-ing field

Since a synthetic-dataset has been used the real Individual Treatment Effect is avail-able to perform testing metrics Therefore the experiment results will consist ofthe performance of each algorithm based on Individual Treatment Effect AverageTreatment Effect (ATE) and Precision in Estimation of Heterogeneous Effect (PEHE)These three are the metrics displayed in the Experiments chapter for each algorithmtrained A detailed explanation of these formulas will be found in the next sections203

Also it is important to notice that the machine learning algorithms are trained justusing the treatment applied (observed) the features (covariates in causal inferenceliterature) and the observed outcome (usually known as Y or Y factual) Aftertraining an completely unseen dataset during is used for testing purposes

The already trained algorithm predicts just the Y factual based on the unit (alsoknown as patient) features (covariates) for the both the cases the unit would havetaken the treatment and likewise predicts the outcomes as if the unit would havetaken the control treatment Once both outcomes are predicted (Y factual and Ycounterfactual the ATE ITE and PEHE metrics are calculated In addition an aver-age score and its deviation for each run of the 10 100 and 1000 replications of theIHDP had been run to evaluate these errors in bigger simulated scenarios

The mathematical notation will be kept as minimum as possible to not confuse thereader with unnecessary information

4 Chapter 1 Introduction

14 Scope and Limitation

In this document the outcomes of the applied treatment to a patient will be onlyanalyzed with respect two possible actions (binary treatment) Multi-valued treat-ments are not going to be covered in the experiments nor in the developed code butthey can be easily extended to cover these cases

A binary treatment is applied but its outcome value is continuous Different to themost common used case of four possibles scenarios in which usually just two can beobserved or measured All the experiments and the code developed can be appliedto discrete outputs but other machine learning techniques could be more suitable forthis type of predictions (classification algorithms) Also there are cases in which thetask is to predict weather to apply or not the treatment to an individual (also knownas unit or patient) To predict in this cases turns out more into a classification task inwhich a threshold on the interval confidence of predicting to affirmatively apply thetreatment is usually set and validated through trial and error against several contin-uous values to determine what would be the one that predicts with best accuracyThis is call as Policy Risk on causal inference literature

This case is more similar to real world scenarios where the data was observed andfinally a decision on applying or not the treatment (action) has to be made in orderto peruse a desired result

In this work the cases in which the dataset contains outcomes in a binary form topredict weather or not to apply the treatment will not be covered

5

Chapter 2

Background

201 Rubin-Newman Causal Model

The Rubin causal model (RCM) (Rubin 2005) also known as Rubin-Newman Poten-tial Outcomes framework is a extended statistical analysis frame to model observa-tional data that Donald Rubin developed He came up with the mentioned frame-work building it on top of the original Newman method that he developed in his1923 masters thesis extending it to non randomized controlled trials (observationaldata)

The Rubin-Newman potential outcomes framework consists in

xi isin X

with an effectively applied treatment

ti isin 0 1

The two possible potential outcomes are defined by

Y0(xi) Y1(xi) isin Y

Of one of them (the one which actually happened) we can observe its factual out-come

yiF= tiY1(xi) + (1 minus ti)Y0(xi)

And let (x1 t1 y1F) (xn tn yn

F) be a unit from the factual distribution

Consequently let (x1 1 minus t1 y1CF) (xn 1 minus tn yn

CF be the counterfactual sample

Notice that all the factual outcomes yF are known whereas is never the case in anyunit for the counterfactual outcomes yCF (except for testing phase and just when thedataset is semi-synthetic or synthetic)

It will be used as interchangeable terms the expressions yF or y f t referring to factualobserved outcomes while yFC or yc f t will be pointing to counterfactual outcomes

6 Chapter 2 Background

202 The fundamental problem of causal analysis

The fundamental problem of causal analysis states that it is impossible given a unit xand assigning either the treatment t = 1 or t = 0 to that unit to observe the counterfactual outcome E[Y0∣x t = 1] or E[Y1∣x t = 0] (what would have happen or whatwould have been the outcome if the other treatment would have been given to theunit x )

However it is always possible to observe the outcome of the effectively appliedtreatment t which is represented as E[Y0∣x t = 0] or E[Y1∣x t = 1] or in shorterterms Y0 or Y1

In this dissertation the focus is on the case when the causal graph is simple andknown to be of the form (Y1 Y0) lt- x -gt t with no hidden confounders

This problem can be extended as most of the problems and applications discussedunder the Rubin-Newman Potential Outcomes Frameworks in this dissertation as amulti-treatment experiment It is important to notice that the problem of not havingaccess to the result of the counter factual outcome Yc f is even worst when extendingthis problem to multi-treatment experiments since the missing values that mattersfor better Individual Treatment Effects are increased in the total order of possibletreatments except the one applied

203 Metrics for Causality

Three well-known metrics in the causality field are reported for each implementedmachine learning technique applied

The losses that will be reported are

bull εITE Error of Individual Treatment Effect - also known as the ConditionalAverage Treatment Effect (CATE) - and it is how well or bad perform the treat-ment on one particular patient

ITE(x) ∶= E[y∣X = x t = 1] minus E[y∣X = x t = 0] = E[Yx1 minusYx0]

bull εATE Error of Average Treatment Effect as it name describes it it representthe effect that the applied treatment either t = 0 or t = 1 depending on thewhole population effectively had Note bold that as an average it can be notthe best solution to treat a new patient with this treatment since its unique char-acteristics as a unit might make them experience wrong results or no results atall

ATE ∶= E[ITE(x)] = E[δ] = E[Y1 minusY0]forallx isin X

bull εPEHE Precision in Estimation of Heterogeneous Effect is used to measure theprecision trade-off between the Individual Treatment Effect and the AverageTreatment Effect It is important to notice that this metric relates the ATE andITE predictions penalizing the predictions that had been predicted right for onemeasure but wrong or not that accurate for the other one

PEHE(x) ∶= 1N

N

sumi=1

((yi1 minus yi0) minus (yi1 minus yi0))2

Chapter 2 Background 7

204 Assumptions

To work on the results three important assumptions under the Rubin-Newmancausal Framework shall be made

bull Consistency For each unit just one of the two potential outcomes can be ob-served Hence if t = 0 then y = Y0 will be the observed outcome or factual(yF) However if the applied treatment was t = 1 afterwards y = Y1 will bethe available observed outcome or factual yF

bull Strong Ignorability Also known as no unmeasured confounders this assump-tion can be stated by (Y1 Y0) ⫫ t∣x and 0 lt p(t = 1∣x) lt 1forallx It is importantto notice that to be able to state this assumption a domain knowledge expertwould have to assess the dataset and therefore determine if there are no un-measured confounders That is the case for the dataset implemented in thiswork

bull Common Support This assumption states that for each unit x isin X there is apositive probability of being both treated (t = 1) and untreated (t = 0)

0 lt P(t = 1∣x) lt 1

205 Definitions

In causal inference from observational data several terms are used interchangeablyand might confuse the reader

This subsection should be clear before going further into this dissertation

Some common synonyms are

bull unit is the subject of the analysis the one that will be applied the treatmentpatient individual input xi xi isin X

bull covariates all the collected (observed) variable that have a direct effect on theoutcomefeatures(ML) x x isin X

bull treatment the possible different actions that can be applied to a unit Usuallybinary but can be multi-valued under the Rubin-Newman Potential OutcomesFrameworkaction t t isin 0 1 t isin 0 N

bull Outcome the measured result of applying a treatment t to a unit xobserved outcome result factual Y factual y f = yF

bull Counterfactual what would have been the result if the opposite treatment tothe effectively applied would have been applied to a unitunobserved outcome yc f t yc f YCF

206 Related Work

Potential outcomes are the framework to mathematically describe causality and coun-terfactuals (Rubin 1978)

8 Chapter 2 Background

Causality from observational data can be clearly applicable to a wide range of in-dustries eg advertisement placement selection health care systems finance oreven to improve education (Recursive Partitioning for Heterogeneous Causal Effects Hoiles and Van Der Schaar 2016 Bottou et al 2013) In particular counterfac-

tual inference in observational studies has been a topic of interested study in eco-nomics statistics health care pharmaceutical companies epidemiology and sociol-ogy (Causal Inference Using Potential Outcomes Design Modeling Decisions Morganand Winship 2014 Causal Inference Using Potential Outcomes Design Modeling De-cisions Chernozhukov et al 2016) whereas in machine learning the attention hasbeen caught not less than a decade ago (Lang 1995 Bottou et al 2013 Swaminathanand Joachims 2015a) A lot of work in machine learning had been targeted for dis-covering the underlying causal graph from collected data (Nonlinear causal discoverywith additive noise models Maathuis et al 2010 Triantafillou and Tsamardinos 2015Mooij et al 2016)

causal inference for counterfactual predictions is usually grouped by parametricnon-parametric and doubly robust methods

For parametric methods causal inference the relationships within features and ac-tions pairs and rewards by implementing one or more parameters trying to specifi-cally model the relations within context outcomes and actions (treatments) In thesemethods linear and logistic regression (Prentice 1976 Gelman and Hill 2007) ran-dom forests (Wager and Athey 2015) and regression trees (Chipman George andMcCulloch 2010) had been used in the past to complete the task For example (Wa-ger and Athey 2017) estimates ITEs by causal Forests but their asymptotic estimatesin datasets with a large number of relevant features has limitations that needs to beaddressed in future work

In non-parametric approaches the counterfactual predictions are mostly calculatedthrough a propensity score matching and re-weighting (Joachims and Swaminathan2016 Austin 2011 Rosenbaum and Rubin 1983 Rosenbaum 2002) To performdoubly robust causality is done by merging parametric and non-parametric methods(Dudik Langford and Li 2011 Jiang and Li 2015)

Double robust methods are known for merging the characteristics of both methodsA common example of this application would be propensity score weighted regres-sion (Bang and Robins 2005 Dudik Langford and Li 2011) When the treatmentassignment probability is known this method models the problem particularly welleg in off-policy evaluation or learning from bandits However in most of the casesin observational data their efficiency drops dramatically (Kang and Schafer 2007)

Machine learning for predicting Individual Treatment effects has been arisen a lotof interest during the last two years through the development of custom metricfunctions -as long as the application of other techniques on causality- with specialfocus on unbalanced treatment application datasets This refers to the sub area ofcausality which is known as causal inference from observational data Observationaldata is data that has been or is collected without the possibility of design and run aproper Randomized Controlled Trial The creation of custom distance learning met-rics and custom loss functions applied to Neural Networks had brought interesting

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 5: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

ix

AcknowledgementsFirst and foremost I would like to thank my supervisor Dr Spyros Samothrakis Hehas been an essential pillar in the whole process of this dissertation and my careergiving me not only academic professional support but also encouraging me to makethe most out of this process

Second I would like to thanks my parents who had always been listening to myissues stress periods and emotional crisis and cheered me up every time I neededthem Without them I would not be able to be here writing this dissertation by anychance They made me believe that everything is possible in life if you try hardenough and you are a honest person Infinitely grateful to them forever

Third to my former university Professors from Universidad Tecnoloacutegica NacionalFactual Regional Buenos Aires Dr Oscar Bruno Dr Alejandro Prince and DraMariacutea Florencia Pollo-Cataneo for their recommendation letters support throughthe year and enlightenment in my professional career giving me always the bestadvice I could always get

Fourth I would like to thanks Dr Uri Shalit who offered me immediate help andadvise on this dissertationrsquos topic and who also helped me on the full IHDP datasetcollection metrics for benchmark comparisons

Fifth to all my classmates (some of them friends now) who spent with me count-less hours discussing about our passion making the world a better place throughMachine Learning and Artificial Intelligence

Last but not least I would like to thanks the Government of Argentina and the Ar-gentinian Ministry of Education for giving me the chance of coming to study to oneof the best universities in the world throughout the BECAR scholarship

xi

Contents

Declaration of Authorship iii

Abstract vii

Acknowledgements ix

1 Introduction 111 Motivation 112 Purpose and Research Question 213 Approach and Methodology 314 Scope and Limitation 4

2 Background 5201 Rubin-Newman Causal Model 5202 The fundamental problem of causal analysis 6203 Metrics for Causality 6204 Assumptions 7205 Definitions 7206 Related Work 7

21 Machine Learning 10211 Ordinary Least Squares (Linear Regression) 10212 Ridge Regression 11213 Support Vector Regressor 11214 Bayesian Ridge 11215 Lasso 12216 Lasso Lars 12217 ARD Regression 12218 Passive Aggressive Regressor 12219 Theil Sen Regressor 122110 K-Neighbors Regressor 132111 Logistic Regression 13

3 Methodology 1531 Dataset 1532 IHDP dataset 1633 Other articles metrics 17

4 Experiments 1941 Machine learning methods applied to IHDP dataset 2042 Other experiments 27

421 Recursive Feature Elimination 27422 Domain Adaptation Neural Networks 27

43 Discussion 27

xii

5 Conclusions 2951 Concluding Remarks 2952 Future work 30

Bibliography 31

xiii

List of Tables

41 IHDP 10 replications with traditional machine learning algorithms -Within sample 20

42 IHDP 10 replications with traditional machine learning algorithms -Out-of-sample 21

43 IHDP 100 replications - Within sample 2144 IHDP 100 replications - Out-of-sample 2145 IHDP 100 replications already split dataset - Within sample 2246 IHDP 100 replications already split dataset - Out-of-sample 2247 IHDP 100 replications - No scaling - Within sample 2348 IHDP 1000 replications - No Scaling - Out-of-sample 2349 IHDP 100 replications - Scaled - Within sample 23410 IHDP 1000 replications - No Scaling - Out-of-sample 24411 IHDP 100 replications logistic regressions - Within sample 24412 IHDP 100 replications logistic regressions - Out-of-sample 24413 IHDP 100 replications SVR Hyper-parameters tunning - Within sample 25414 IHDP 100 replications SVR Hyper-parameters tunning - Out-of-sample 25415 ICML 2017 - Estimating individual treatment effect generalization

bounds and algorithms (Shalit Johansson and Sontag 2017) 26416 ICML 2017 - Estimating individual treatment effect generalization

bounds and algorithms (Shalit Johansson and Sontag 2017) 26417 Domain Adaptation Neural Networks 27

xv

List of Abbreviations

ML Machine LearningSVR Support Vector RegressorRL Reinforcement LearningNN Neural NetworksLR Linear RegressionKNN K Nearest NeighboursRCE Randomized Controlled ExperimentITE Individual Treatment EffectATE Avarage Treatment EffectPEHE Precision in Estimation of Heterogenous EffectsCATE Conditional Average Treatment EffectRCM Rubin Casual Model

1

Chapter 1

Introduction

11 Motivation

Causality is often confused with correlation Correlation does not imply causationThese inferences are often called spurious correlations and they often confuses theinference process in which humans make decisions

A common definition of Causality is still not agreed by the scientific communitynowadays

The proven scientific way to make claims about cause and effect it is to performwhat is called a Randomized Controlled Trial(RCT) In a Randomized ControlledTrial a statistically representative portion of the population that will be participatingof the experiment (trial) are exposed to a treatment(action) which could be eitherpositive - apply the treatment - or neutral (control) -giving the patient a placebo ornot treating the patient(unit) at all

All these concepts are related to medical words since the field in which RCTs areapplied the most is medical trials However it is not the only industry in whichthese concept of dragging conclusions from a trial can be done For example it iswidely used in social studies but can also be applied to make decisions on buyingselling or holding a particular stock or displaying an advertisement that generatemore sales than the others in the advertisements industry

Nevertheless the Randomized Controlled Trials are the best way to detect causaleffects the possibility of perform such scientific rigorous experiments is most ofthe times either impossible or unethical An example could be seem when tryingto detect if driving while being under the effects of alcohol can affect (or not) thedriverrsquos skills Another clear example of this is determining the causes of smokingin teenagers or young people in which to perform a RCT would involve to take twogroups of non-smoking teenagers make half of the units smoke for several years - itdepends on the experiment or the research question - and then determine if smokingin young people would cause something or not after that period As the readercan infer there are clearly ethical problems associated with performing full RCTs todetermine causes and effects

It is also important to let clear that there are cases in which it would be impossi-ble to analyze previously collected data without having in mind the RCT methodThese cases are known as observational data Observational data is defined as infor-mation that can be obtained from previously collected situations in which a formal

2 Chapter 1 Introduction

randomized trial method has not been applied but it is still important to try to de-termine causes and effects from that data This is the case in most organizations inthe present since they would possibly be collecting massive amounts of data duringthe last decades but they could not or would not be able to establish a RCT processduring collection of the metrics Moreover sometimes the data collection processhappens in a non randomized controlled experiment For example a not so com-mon disease that just affects a small percentage of the population might happen toappear within a wide range of people that makes the inference process difficult

In the past causal Inference methods have been a Statisticians only field How-ever with recent advances of machine learning algorithms more computer scientistsand machine learning engineers have been trying to infer causes and relationshipsthrough traditional and new machine learning techniques

The ability to learn complex non-linear relationships of some machine learning algo-rithms have been trying to detect and predict policies in which given the particularfeatures of an individual (patient) the algorithm could determine if to apply or notthe treatment (action) to them This concept is also known as Individual Treatmenteffect estimation or it could also be referred as Policy risk when predicting a binaryclass which is is to apply or not the treatment through the discovery of a certainthreshold

It is a matter of interest to be able to predict the individual (customized) treatmenteffect because this would lead to better decisions(actions or treatments) specificallyshaped for each person and not only relaying on the average of the whole studiedpopulation

The ultimately motivation of this work is to be able to predict the IndividualTreatment Effects for new patients with the previously collected data using al-ternative machine learning techniques to the ones used in past research effortsMoreover the compiling of a literature that can be understood by more computerscientist will be tried to present as well as running code of all the experiments per-formed will be released for others to build upon it

12 Purpose and Research Question

The purpose of this dissertation is to predict the Individual Treatment Effect (ITE)the Average Treatment Effect (ATE) and the Precision in Estimation of Heteroge-neous Effect (PEHE) for a widely adopted benchmark dataset in the field usuallyrefer to this as IHDP (Infant Health Developed Program) which is a semi-syntheticdataset particularly unbalanced and created for the task of causal inference on ob-servational data(Hill 2011)

The research experiments are about trying alternative machine learning methodswithout adding extra complexity custom error loss or custom metric functions whilelearning and predicting would be able to obtain similar or even better results thanthe state-of-the-art metrics based on the exact same benchmark dataset

13 Approach and Methodology 3

13 Approach and Methodology

For causality analysis purposes the Rubin Potential Outcomes Framework (CausalInference Using Potential Outcomes Design Modeling Decisions) and its notation willbe used during this whole thesis This model is also known as the Newman-Rubincausal model and it is an approach to statistical analysis of cause and effect based onthe potential outcomes framework

For the machine learning experiments the latest version of the scikit-learn(Userguide contents mdash scikit-learn 0192 documentation) python framework had been usedAll their underlaying methods and default hyper-parameters had been used Alsothe mathematical notation of their documentation will be presented to describe eachalgorithmrsquos functions and limitations

The different algorithms had been tested with a benchmark standard dataset for causalinference from observational data Infant Health and Development Program (IHDP)introduced by (Hill 2011) as a semi synthetic-dataset based on real features ob-tained from a real an observational study (Gross 1993) Replications on this datasethad been created to get 10 100 and 1000 cases to be able to train and predict the ma-chine learning models on them and get the desired metric error results afterwards

It is important to notice that the testing method and the metrics used to determinethe effectiveness of each algorithm are different from the ones normally used to testmachine learning algorithms and testing these algorithms within a causal frame-work differs substantially from the usual traintest paradigm of the machine learn-ing field

Since a synthetic-dataset has been used the real Individual Treatment Effect is avail-able to perform testing metrics Therefore the experiment results will consist ofthe performance of each algorithm based on Individual Treatment Effect AverageTreatment Effect (ATE) and Precision in Estimation of Heterogeneous Effect (PEHE)These three are the metrics displayed in the Experiments chapter for each algorithmtrained A detailed explanation of these formulas will be found in the next sections203

Also it is important to notice that the machine learning algorithms are trained justusing the treatment applied (observed) the features (covariates in causal inferenceliterature) and the observed outcome (usually known as Y or Y factual) Aftertraining an completely unseen dataset during is used for testing purposes

The already trained algorithm predicts just the Y factual based on the unit (alsoknown as patient) features (covariates) for the both the cases the unit would havetaken the treatment and likewise predicts the outcomes as if the unit would havetaken the control treatment Once both outcomes are predicted (Y factual and Ycounterfactual the ATE ITE and PEHE metrics are calculated In addition an aver-age score and its deviation for each run of the 10 100 and 1000 replications of theIHDP had been run to evaluate these errors in bigger simulated scenarios

The mathematical notation will be kept as minimum as possible to not confuse thereader with unnecessary information

4 Chapter 1 Introduction

14 Scope and Limitation

In this document the outcomes of the applied treatment to a patient will be onlyanalyzed with respect two possible actions (binary treatment) Multi-valued treat-ments are not going to be covered in the experiments nor in the developed code butthey can be easily extended to cover these cases

A binary treatment is applied but its outcome value is continuous Different to themost common used case of four possibles scenarios in which usually just two can beobserved or measured All the experiments and the code developed can be appliedto discrete outputs but other machine learning techniques could be more suitable forthis type of predictions (classification algorithms) Also there are cases in which thetask is to predict weather to apply or not the treatment to an individual (also knownas unit or patient) To predict in this cases turns out more into a classification task inwhich a threshold on the interval confidence of predicting to affirmatively apply thetreatment is usually set and validated through trial and error against several contin-uous values to determine what would be the one that predicts with best accuracyThis is call as Policy Risk on causal inference literature

This case is more similar to real world scenarios where the data was observed andfinally a decision on applying or not the treatment (action) has to be made in orderto peruse a desired result

In this work the cases in which the dataset contains outcomes in a binary form topredict weather or not to apply the treatment will not be covered

5

Chapter 2

Background

201 Rubin-Newman Causal Model

The Rubin causal model (RCM) (Rubin 2005) also known as Rubin-Newman Poten-tial Outcomes framework is a extended statistical analysis frame to model observa-tional data that Donald Rubin developed He came up with the mentioned frame-work building it on top of the original Newman method that he developed in his1923 masters thesis extending it to non randomized controlled trials (observationaldata)

The Rubin-Newman potential outcomes framework consists in

xi isin X

with an effectively applied treatment

ti isin 0 1

The two possible potential outcomes are defined by

Y0(xi) Y1(xi) isin Y

Of one of them (the one which actually happened) we can observe its factual out-come

yiF= tiY1(xi) + (1 minus ti)Y0(xi)

And let (x1 t1 y1F) (xn tn yn

F) be a unit from the factual distribution

Consequently let (x1 1 minus t1 y1CF) (xn 1 minus tn yn

CF be the counterfactual sample

Notice that all the factual outcomes yF are known whereas is never the case in anyunit for the counterfactual outcomes yCF (except for testing phase and just when thedataset is semi-synthetic or synthetic)

It will be used as interchangeable terms the expressions yF or y f t referring to factualobserved outcomes while yFC or yc f t will be pointing to counterfactual outcomes

6 Chapter 2 Background

202 The fundamental problem of causal analysis

The fundamental problem of causal analysis states that it is impossible given a unit xand assigning either the treatment t = 1 or t = 0 to that unit to observe the counterfactual outcome E[Y0∣x t = 1] or E[Y1∣x t = 0] (what would have happen or whatwould have been the outcome if the other treatment would have been given to theunit x )

However it is always possible to observe the outcome of the effectively appliedtreatment t which is represented as E[Y0∣x t = 0] or E[Y1∣x t = 1] or in shorterterms Y0 or Y1

In this dissertation the focus is on the case when the causal graph is simple andknown to be of the form (Y1 Y0) lt- x -gt t with no hidden confounders

This problem can be extended as most of the problems and applications discussedunder the Rubin-Newman Potential Outcomes Frameworks in this dissertation as amulti-treatment experiment It is important to notice that the problem of not havingaccess to the result of the counter factual outcome Yc f is even worst when extendingthis problem to multi-treatment experiments since the missing values that mattersfor better Individual Treatment Effects are increased in the total order of possibletreatments except the one applied

203 Metrics for Causality

Three well-known metrics in the causality field are reported for each implementedmachine learning technique applied

The losses that will be reported are

bull εITE Error of Individual Treatment Effect - also known as the ConditionalAverage Treatment Effect (CATE) - and it is how well or bad perform the treat-ment on one particular patient

ITE(x) ∶= E[y∣X = x t = 1] minus E[y∣X = x t = 0] = E[Yx1 minusYx0]

bull εATE Error of Average Treatment Effect as it name describes it it representthe effect that the applied treatment either t = 0 or t = 1 depending on thewhole population effectively had Note bold that as an average it can be notthe best solution to treat a new patient with this treatment since its unique char-acteristics as a unit might make them experience wrong results or no results atall

ATE ∶= E[ITE(x)] = E[δ] = E[Y1 minusY0]forallx isin X

bull εPEHE Precision in Estimation of Heterogeneous Effect is used to measure theprecision trade-off between the Individual Treatment Effect and the AverageTreatment Effect It is important to notice that this metric relates the ATE andITE predictions penalizing the predictions that had been predicted right for onemeasure but wrong or not that accurate for the other one

PEHE(x) ∶= 1N

N

sumi=1

((yi1 minus yi0) minus (yi1 minus yi0))2

Chapter 2 Background 7

204 Assumptions

To work on the results three important assumptions under the Rubin-Newmancausal Framework shall be made

bull Consistency For each unit just one of the two potential outcomes can be ob-served Hence if t = 0 then y = Y0 will be the observed outcome or factual(yF) However if the applied treatment was t = 1 afterwards y = Y1 will bethe available observed outcome or factual yF

bull Strong Ignorability Also known as no unmeasured confounders this assump-tion can be stated by (Y1 Y0) ⫫ t∣x and 0 lt p(t = 1∣x) lt 1forallx It is importantto notice that to be able to state this assumption a domain knowledge expertwould have to assess the dataset and therefore determine if there are no un-measured confounders That is the case for the dataset implemented in thiswork

bull Common Support This assumption states that for each unit x isin X there is apositive probability of being both treated (t = 1) and untreated (t = 0)

0 lt P(t = 1∣x) lt 1

205 Definitions

In causal inference from observational data several terms are used interchangeablyand might confuse the reader

This subsection should be clear before going further into this dissertation

Some common synonyms are

bull unit is the subject of the analysis the one that will be applied the treatmentpatient individual input xi xi isin X

bull covariates all the collected (observed) variable that have a direct effect on theoutcomefeatures(ML) x x isin X

bull treatment the possible different actions that can be applied to a unit Usuallybinary but can be multi-valued under the Rubin-Newman Potential OutcomesFrameworkaction t t isin 0 1 t isin 0 N

bull Outcome the measured result of applying a treatment t to a unit xobserved outcome result factual Y factual y f = yF

bull Counterfactual what would have been the result if the opposite treatment tothe effectively applied would have been applied to a unitunobserved outcome yc f t yc f YCF

206 Related Work

Potential outcomes are the framework to mathematically describe causality and coun-terfactuals (Rubin 1978)

8 Chapter 2 Background

Causality from observational data can be clearly applicable to a wide range of in-dustries eg advertisement placement selection health care systems finance oreven to improve education (Recursive Partitioning for Heterogeneous Causal Effects Hoiles and Van Der Schaar 2016 Bottou et al 2013) In particular counterfac-

tual inference in observational studies has been a topic of interested study in eco-nomics statistics health care pharmaceutical companies epidemiology and sociol-ogy (Causal Inference Using Potential Outcomes Design Modeling Decisions Morganand Winship 2014 Causal Inference Using Potential Outcomes Design Modeling De-cisions Chernozhukov et al 2016) whereas in machine learning the attention hasbeen caught not less than a decade ago (Lang 1995 Bottou et al 2013 Swaminathanand Joachims 2015a) A lot of work in machine learning had been targeted for dis-covering the underlying causal graph from collected data (Nonlinear causal discoverywith additive noise models Maathuis et al 2010 Triantafillou and Tsamardinos 2015Mooij et al 2016)

causal inference for counterfactual predictions is usually grouped by parametricnon-parametric and doubly robust methods

For parametric methods causal inference the relationships within features and ac-tions pairs and rewards by implementing one or more parameters trying to specifi-cally model the relations within context outcomes and actions (treatments) In thesemethods linear and logistic regression (Prentice 1976 Gelman and Hill 2007) ran-dom forests (Wager and Athey 2015) and regression trees (Chipman George andMcCulloch 2010) had been used in the past to complete the task For example (Wa-ger and Athey 2017) estimates ITEs by causal Forests but their asymptotic estimatesin datasets with a large number of relevant features has limitations that needs to beaddressed in future work

In non-parametric approaches the counterfactual predictions are mostly calculatedthrough a propensity score matching and re-weighting (Joachims and Swaminathan2016 Austin 2011 Rosenbaum and Rubin 1983 Rosenbaum 2002) To performdoubly robust causality is done by merging parametric and non-parametric methods(Dudik Langford and Li 2011 Jiang and Li 2015)

Double robust methods are known for merging the characteristics of both methodsA common example of this application would be propensity score weighted regres-sion (Bang and Robins 2005 Dudik Langford and Li 2011) When the treatmentassignment probability is known this method models the problem particularly welleg in off-policy evaluation or learning from bandits However in most of the casesin observational data their efficiency drops dramatically (Kang and Schafer 2007)

Machine learning for predicting Individual Treatment effects has been arisen a lotof interest during the last two years through the development of custom metricfunctions -as long as the application of other techniques on causality- with specialfocus on unbalanced treatment application datasets This refers to the sub area ofcausality which is known as causal inference from observational data Observationaldata is data that has been or is collected without the possibility of design and run aproper Randomized Controlled Trial The creation of custom distance learning met-rics and custom loss functions applied to Neural Networks had brought interesting

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 6: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

xi

Contents

Declaration of Authorship iii

Abstract vii

Acknowledgements ix

1 Introduction 111 Motivation 112 Purpose and Research Question 213 Approach and Methodology 314 Scope and Limitation 4

2 Background 5201 Rubin-Newman Causal Model 5202 The fundamental problem of causal analysis 6203 Metrics for Causality 6204 Assumptions 7205 Definitions 7206 Related Work 7

21 Machine Learning 10211 Ordinary Least Squares (Linear Regression) 10212 Ridge Regression 11213 Support Vector Regressor 11214 Bayesian Ridge 11215 Lasso 12216 Lasso Lars 12217 ARD Regression 12218 Passive Aggressive Regressor 12219 Theil Sen Regressor 122110 K-Neighbors Regressor 132111 Logistic Regression 13

3 Methodology 1531 Dataset 1532 IHDP dataset 1633 Other articles metrics 17

4 Experiments 1941 Machine learning methods applied to IHDP dataset 2042 Other experiments 27

421 Recursive Feature Elimination 27422 Domain Adaptation Neural Networks 27

43 Discussion 27

xii

5 Conclusions 2951 Concluding Remarks 2952 Future work 30

Bibliography 31

xiii

List of Tables

41 IHDP 10 replications with traditional machine learning algorithms -Within sample 20

42 IHDP 10 replications with traditional machine learning algorithms -Out-of-sample 21

43 IHDP 100 replications - Within sample 2144 IHDP 100 replications - Out-of-sample 2145 IHDP 100 replications already split dataset - Within sample 2246 IHDP 100 replications already split dataset - Out-of-sample 2247 IHDP 100 replications - No scaling - Within sample 2348 IHDP 1000 replications - No Scaling - Out-of-sample 2349 IHDP 100 replications - Scaled - Within sample 23410 IHDP 1000 replications - No Scaling - Out-of-sample 24411 IHDP 100 replications logistic regressions - Within sample 24412 IHDP 100 replications logistic regressions - Out-of-sample 24413 IHDP 100 replications SVR Hyper-parameters tunning - Within sample 25414 IHDP 100 replications SVR Hyper-parameters tunning - Out-of-sample 25415 ICML 2017 - Estimating individual treatment effect generalization

bounds and algorithms (Shalit Johansson and Sontag 2017) 26416 ICML 2017 - Estimating individual treatment effect generalization

bounds and algorithms (Shalit Johansson and Sontag 2017) 26417 Domain Adaptation Neural Networks 27

xv

List of Abbreviations

ML Machine LearningSVR Support Vector RegressorRL Reinforcement LearningNN Neural NetworksLR Linear RegressionKNN K Nearest NeighboursRCE Randomized Controlled ExperimentITE Individual Treatment EffectATE Avarage Treatment EffectPEHE Precision in Estimation of Heterogenous EffectsCATE Conditional Average Treatment EffectRCM Rubin Casual Model

1

Chapter 1

Introduction

11 Motivation

Causality is often confused with correlation Correlation does not imply causationThese inferences are often called spurious correlations and they often confuses theinference process in which humans make decisions

A common definition of Causality is still not agreed by the scientific communitynowadays

The proven scientific way to make claims about cause and effect it is to performwhat is called a Randomized Controlled Trial(RCT) In a Randomized ControlledTrial a statistically representative portion of the population that will be participatingof the experiment (trial) are exposed to a treatment(action) which could be eitherpositive - apply the treatment - or neutral (control) -giving the patient a placebo ornot treating the patient(unit) at all

All these concepts are related to medical words since the field in which RCTs areapplied the most is medical trials However it is not the only industry in whichthese concept of dragging conclusions from a trial can be done For example it iswidely used in social studies but can also be applied to make decisions on buyingselling or holding a particular stock or displaying an advertisement that generatemore sales than the others in the advertisements industry

Nevertheless the Randomized Controlled Trials are the best way to detect causaleffects the possibility of perform such scientific rigorous experiments is most ofthe times either impossible or unethical An example could be seem when tryingto detect if driving while being under the effects of alcohol can affect (or not) thedriverrsquos skills Another clear example of this is determining the causes of smokingin teenagers or young people in which to perform a RCT would involve to take twogroups of non-smoking teenagers make half of the units smoke for several years - itdepends on the experiment or the research question - and then determine if smokingin young people would cause something or not after that period As the readercan infer there are clearly ethical problems associated with performing full RCTs todetermine causes and effects

It is also important to let clear that there are cases in which it would be impossi-ble to analyze previously collected data without having in mind the RCT methodThese cases are known as observational data Observational data is defined as infor-mation that can be obtained from previously collected situations in which a formal

2 Chapter 1 Introduction

randomized trial method has not been applied but it is still important to try to de-termine causes and effects from that data This is the case in most organizations inthe present since they would possibly be collecting massive amounts of data duringthe last decades but they could not or would not be able to establish a RCT processduring collection of the metrics Moreover sometimes the data collection processhappens in a non randomized controlled experiment For example a not so com-mon disease that just affects a small percentage of the population might happen toappear within a wide range of people that makes the inference process difficult

In the past causal Inference methods have been a Statisticians only field How-ever with recent advances of machine learning algorithms more computer scientistsand machine learning engineers have been trying to infer causes and relationshipsthrough traditional and new machine learning techniques

The ability to learn complex non-linear relationships of some machine learning algo-rithms have been trying to detect and predict policies in which given the particularfeatures of an individual (patient) the algorithm could determine if to apply or notthe treatment (action) to them This concept is also known as Individual Treatmenteffect estimation or it could also be referred as Policy risk when predicting a binaryclass which is is to apply or not the treatment through the discovery of a certainthreshold

It is a matter of interest to be able to predict the individual (customized) treatmenteffect because this would lead to better decisions(actions or treatments) specificallyshaped for each person and not only relaying on the average of the whole studiedpopulation

The ultimately motivation of this work is to be able to predict the IndividualTreatment Effects for new patients with the previously collected data using al-ternative machine learning techniques to the ones used in past research effortsMoreover the compiling of a literature that can be understood by more computerscientist will be tried to present as well as running code of all the experiments per-formed will be released for others to build upon it

12 Purpose and Research Question

The purpose of this dissertation is to predict the Individual Treatment Effect (ITE)the Average Treatment Effect (ATE) and the Precision in Estimation of Heteroge-neous Effect (PEHE) for a widely adopted benchmark dataset in the field usuallyrefer to this as IHDP (Infant Health Developed Program) which is a semi-syntheticdataset particularly unbalanced and created for the task of causal inference on ob-servational data(Hill 2011)

The research experiments are about trying alternative machine learning methodswithout adding extra complexity custom error loss or custom metric functions whilelearning and predicting would be able to obtain similar or even better results thanthe state-of-the-art metrics based on the exact same benchmark dataset

13 Approach and Methodology 3

13 Approach and Methodology

For causality analysis purposes the Rubin Potential Outcomes Framework (CausalInference Using Potential Outcomes Design Modeling Decisions) and its notation willbe used during this whole thesis This model is also known as the Newman-Rubincausal model and it is an approach to statistical analysis of cause and effect based onthe potential outcomes framework

For the machine learning experiments the latest version of the scikit-learn(Userguide contents mdash scikit-learn 0192 documentation) python framework had been usedAll their underlaying methods and default hyper-parameters had been used Alsothe mathematical notation of their documentation will be presented to describe eachalgorithmrsquos functions and limitations

The different algorithms had been tested with a benchmark standard dataset for causalinference from observational data Infant Health and Development Program (IHDP)introduced by (Hill 2011) as a semi synthetic-dataset based on real features ob-tained from a real an observational study (Gross 1993) Replications on this datasethad been created to get 10 100 and 1000 cases to be able to train and predict the ma-chine learning models on them and get the desired metric error results afterwards

It is important to notice that the testing method and the metrics used to determinethe effectiveness of each algorithm are different from the ones normally used to testmachine learning algorithms and testing these algorithms within a causal frame-work differs substantially from the usual traintest paradigm of the machine learn-ing field

Since a synthetic-dataset has been used the real Individual Treatment Effect is avail-able to perform testing metrics Therefore the experiment results will consist ofthe performance of each algorithm based on Individual Treatment Effect AverageTreatment Effect (ATE) and Precision in Estimation of Heterogeneous Effect (PEHE)These three are the metrics displayed in the Experiments chapter for each algorithmtrained A detailed explanation of these formulas will be found in the next sections203

Also it is important to notice that the machine learning algorithms are trained justusing the treatment applied (observed) the features (covariates in causal inferenceliterature) and the observed outcome (usually known as Y or Y factual) Aftertraining an completely unseen dataset during is used for testing purposes

The already trained algorithm predicts just the Y factual based on the unit (alsoknown as patient) features (covariates) for the both the cases the unit would havetaken the treatment and likewise predicts the outcomes as if the unit would havetaken the control treatment Once both outcomes are predicted (Y factual and Ycounterfactual the ATE ITE and PEHE metrics are calculated In addition an aver-age score and its deviation for each run of the 10 100 and 1000 replications of theIHDP had been run to evaluate these errors in bigger simulated scenarios

The mathematical notation will be kept as minimum as possible to not confuse thereader with unnecessary information

4 Chapter 1 Introduction

14 Scope and Limitation

In this document the outcomes of the applied treatment to a patient will be onlyanalyzed with respect two possible actions (binary treatment) Multi-valued treat-ments are not going to be covered in the experiments nor in the developed code butthey can be easily extended to cover these cases

A binary treatment is applied but its outcome value is continuous Different to themost common used case of four possibles scenarios in which usually just two can beobserved or measured All the experiments and the code developed can be appliedto discrete outputs but other machine learning techniques could be more suitable forthis type of predictions (classification algorithms) Also there are cases in which thetask is to predict weather to apply or not the treatment to an individual (also knownas unit or patient) To predict in this cases turns out more into a classification task inwhich a threshold on the interval confidence of predicting to affirmatively apply thetreatment is usually set and validated through trial and error against several contin-uous values to determine what would be the one that predicts with best accuracyThis is call as Policy Risk on causal inference literature

This case is more similar to real world scenarios where the data was observed andfinally a decision on applying or not the treatment (action) has to be made in orderto peruse a desired result

In this work the cases in which the dataset contains outcomes in a binary form topredict weather or not to apply the treatment will not be covered

5

Chapter 2

Background

201 Rubin-Newman Causal Model

The Rubin causal model (RCM) (Rubin 2005) also known as Rubin-Newman Poten-tial Outcomes framework is a extended statistical analysis frame to model observa-tional data that Donald Rubin developed He came up with the mentioned frame-work building it on top of the original Newman method that he developed in his1923 masters thesis extending it to non randomized controlled trials (observationaldata)

The Rubin-Newman potential outcomes framework consists in

xi isin X

with an effectively applied treatment

ti isin 0 1

The two possible potential outcomes are defined by

Y0(xi) Y1(xi) isin Y

Of one of them (the one which actually happened) we can observe its factual out-come

yiF= tiY1(xi) + (1 minus ti)Y0(xi)

And let (x1 t1 y1F) (xn tn yn

F) be a unit from the factual distribution

Consequently let (x1 1 minus t1 y1CF) (xn 1 minus tn yn

CF be the counterfactual sample

Notice that all the factual outcomes yF are known whereas is never the case in anyunit for the counterfactual outcomes yCF (except for testing phase and just when thedataset is semi-synthetic or synthetic)

It will be used as interchangeable terms the expressions yF or y f t referring to factualobserved outcomes while yFC or yc f t will be pointing to counterfactual outcomes

6 Chapter 2 Background

202 The fundamental problem of causal analysis

The fundamental problem of causal analysis states that it is impossible given a unit xand assigning either the treatment t = 1 or t = 0 to that unit to observe the counterfactual outcome E[Y0∣x t = 1] or E[Y1∣x t = 0] (what would have happen or whatwould have been the outcome if the other treatment would have been given to theunit x )

However it is always possible to observe the outcome of the effectively appliedtreatment t which is represented as E[Y0∣x t = 0] or E[Y1∣x t = 1] or in shorterterms Y0 or Y1

In this dissertation the focus is on the case when the causal graph is simple andknown to be of the form (Y1 Y0) lt- x -gt t with no hidden confounders

This problem can be extended as most of the problems and applications discussedunder the Rubin-Newman Potential Outcomes Frameworks in this dissertation as amulti-treatment experiment It is important to notice that the problem of not havingaccess to the result of the counter factual outcome Yc f is even worst when extendingthis problem to multi-treatment experiments since the missing values that mattersfor better Individual Treatment Effects are increased in the total order of possibletreatments except the one applied

203 Metrics for Causality

Three well-known metrics in the causality field are reported for each implementedmachine learning technique applied

The losses that will be reported are

bull εITE Error of Individual Treatment Effect - also known as the ConditionalAverage Treatment Effect (CATE) - and it is how well or bad perform the treat-ment on one particular patient

ITE(x) ∶= E[y∣X = x t = 1] minus E[y∣X = x t = 0] = E[Yx1 minusYx0]

bull εATE Error of Average Treatment Effect as it name describes it it representthe effect that the applied treatment either t = 0 or t = 1 depending on thewhole population effectively had Note bold that as an average it can be notthe best solution to treat a new patient with this treatment since its unique char-acteristics as a unit might make them experience wrong results or no results atall

ATE ∶= E[ITE(x)] = E[δ] = E[Y1 minusY0]forallx isin X

bull εPEHE Precision in Estimation of Heterogeneous Effect is used to measure theprecision trade-off between the Individual Treatment Effect and the AverageTreatment Effect It is important to notice that this metric relates the ATE andITE predictions penalizing the predictions that had been predicted right for onemeasure but wrong or not that accurate for the other one

PEHE(x) ∶= 1N

N

sumi=1

((yi1 minus yi0) minus (yi1 minus yi0))2

Chapter 2 Background 7

204 Assumptions

To work on the results three important assumptions under the Rubin-Newmancausal Framework shall be made

bull Consistency For each unit just one of the two potential outcomes can be ob-served Hence if t = 0 then y = Y0 will be the observed outcome or factual(yF) However if the applied treatment was t = 1 afterwards y = Y1 will bethe available observed outcome or factual yF

bull Strong Ignorability Also known as no unmeasured confounders this assump-tion can be stated by (Y1 Y0) ⫫ t∣x and 0 lt p(t = 1∣x) lt 1forallx It is importantto notice that to be able to state this assumption a domain knowledge expertwould have to assess the dataset and therefore determine if there are no un-measured confounders That is the case for the dataset implemented in thiswork

bull Common Support This assumption states that for each unit x isin X there is apositive probability of being both treated (t = 1) and untreated (t = 0)

0 lt P(t = 1∣x) lt 1

205 Definitions

In causal inference from observational data several terms are used interchangeablyand might confuse the reader

This subsection should be clear before going further into this dissertation

Some common synonyms are

bull unit is the subject of the analysis the one that will be applied the treatmentpatient individual input xi xi isin X

bull covariates all the collected (observed) variable that have a direct effect on theoutcomefeatures(ML) x x isin X

bull treatment the possible different actions that can be applied to a unit Usuallybinary but can be multi-valued under the Rubin-Newman Potential OutcomesFrameworkaction t t isin 0 1 t isin 0 N

bull Outcome the measured result of applying a treatment t to a unit xobserved outcome result factual Y factual y f = yF

bull Counterfactual what would have been the result if the opposite treatment tothe effectively applied would have been applied to a unitunobserved outcome yc f t yc f YCF

206 Related Work

Potential outcomes are the framework to mathematically describe causality and coun-terfactuals (Rubin 1978)

8 Chapter 2 Background

Causality from observational data can be clearly applicable to a wide range of in-dustries eg advertisement placement selection health care systems finance oreven to improve education (Recursive Partitioning for Heterogeneous Causal Effects Hoiles and Van Der Schaar 2016 Bottou et al 2013) In particular counterfac-

tual inference in observational studies has been a topic of interested study in eco-nomics statistics health care pharmaceutical companies epidemiology and sociol-ogy (Causal Inference Using Potential Outcomes Design Modeling Decisions Morganand Winship 2014 Causal Inference Using Potential Outcomes Design Modeling De-cisions Chernozhukov et al 2016) whereas in machine learning the attention hasbeen caught not less than a decade ago (Lang 1995 Bottou et al 2013 Swaminathanand Joachims 2015a) A lot of work in machine learning had been targeted for dis-covering the underlying causal graph from collected data (Nonlinear causal discoverywith additive noise models Maathuis et al 2010 Triantafillou and Tsamardinos 2015Mooij et al 2016)

causal inference for counterfactual predictions is usually grouped by parametricnon-parametric and doubly robust methods

For parametric methods causal inference the relationships within features and ac-tions pairs and rewards by implementing one or more parameters trying to specifi-cally model the relations within context outcomes and actions (treatments) In thesemethods linear and logistic regression (Prentice 1976 Gelman and Hill 2007) ran-dom forests (Wager and Athey 2015) and regression trees (Chipman George andMcCulloch 2010) had been used in the past to complete the task For example (Wa-ger and Athey 2017) estimates ITEs by causal Forests but their asymptotic estimatesin datasets with a large number of relevant features has limitations that needs to beaddressed in future work

In non-parametric approaches the counterfactual predictions are mostly calculatedthrough a propensity score matching and re-weighting (Joachims and Swaminathan2016 Austin 2011 Rosenbaum and Rubin 1983 Rosenbaum 2002) To performdoubly robust causality is done by merging parametric and non-parametric methods(Dudik Langford and Li 2011 Jiang and Li 2015)

Double robust methods are known for merging the characteristics of both methodsA common example of this application would be propensity score weighted regres-sion (Bang and Robins 2005 Dudik Langford and Li 2011) When the treatmentassignment probability is known this method models the problem particularly welleg in off-policy evaluation or learning from bandits However in most of the casesin observational data their efficiency drops dramatically (Kang and Schafer 2007)

Machine learning for predicting Individual Treatment effects has been arisen a lotof interest during the last two years through the development of custom metricfunctions -as long as the application of other techniques on causality- with specialfocus on unbalanced treatment application datasets This refers to the sub area ofcausality which is known as causal inference from observational data Observationaldata is data that has been or is collected without the possibility of design and run aproper Randomized Controlled Trial The creation of custom distance learning met-rics and custom loss functions applied to Neural Networks had brought interesting

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 7: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

xii

5 Conclusions 2951 Concluding Remarks 2952 Future work 30

Bibliography 31

xiii

List of Tables

41 IHDP 10 replications with traditional machine learning algorithms -Within sample 20

42 IHDP 10 replications with traditional machine learning algorithms -Out-of-sample 21

43 IHDP 100 replications - Within sample 2144 IHDP 100 replications - Out-of-sample 2145 IHDP 100 replications already split dataset - Within sample 2246 IHDP 100 replications already split dataset - Out-of-sample 2247 IHDP 100 replications - No scaling - Within sample 2348 IHDP 1000 replications - No Scaling - Out-of-sample 2349 IHDP 100 replications - Scaled - Within sample 23410 IHDP 1000 replications - No Scaling - Out-of-sample 24411 IHDP 100 replications logistic regressions - Within sample 24412 IHDP 100 replications logistic regressions - Out-of-sample 24413 IHDP 100 replications SVR Hyper-parameters tunning - Within sample 25414 IHDP 100 replications SVR Hyper-parameters tunning - Out-of-sample 25415 ICML 2017 - Estimating individual treatment effect generalization

bounds and algorithms (Shalit Johansson and Sontag 2017) 26416 ICML 2017 - Estimating individual treatment effect generalization

bounds and algorithms (Shalit Johansson and Sontag 2017) 26417 Domain Adaptation Neural Networks 27

xv

List of Abbreviations

ML Machine LearningSVR Support Vector RegressorRL Reinforcement LearningNN Neural NetworksLR Linear RegressionKNN K Nearest NeighboursRCE Randomized Controlled ExperimentITE Individual Treatment EffectATE Avarage Treatment EffectPEHE Precision in Estimation of Heterogenous EffectsCATE Conditional Average Treatment EffectRCM Rubin Casual Model

1

Chapter 1

Introduction

11 Motivation

Causality is often confused with correlation Correlation does not imply causationThese inferences are often called spurious correlations and they often confuses theinference process in which humans make decisions

A common definition of Causality is still not agreed by the scientific communitynowadays

The proven scientific way to make claims about cause and effect it is to performwhat is called a Randomized Controlled Trial(RCT) In a Randomized ControlledTrial a statistically representative portion of the population that will be participatingof the experiment (trial) are exposed to a treatment(action) which could be eitherpositive - apply the treatment - or neutral (control) -giving the patient a placebo ornot treating the patient(unit) at all

All these concepts are related to medical words since the field in which RCTs areapplied the most is medical trials However it is not the only industry in whichthese concept of dragging conclusions from a trial can be done For example it iswidely used in social studies but can also be applied to make decisions on buyingselling or holding a particular stock or displaying an advertisement that generatemore sales than the others in the advertisements industry

Nevertheless the Randomized Controlled Trials are the best way to detect causaleffects the possibility of perform such scientific rigorous experiments is most ofthe times either impossible or unethical An example could be seem when tryingto detect if driving while being under the effects of alcohol can affect (or not) thedriverrsquos skills Another clear example of this is determining the causes of smokingin teenagers or young people in which to perform a RCT would involve to take twogroups of non-smoking teenagers make half of the units smoke for several years - itdepends on the experiment or the research question - and then determine if smokingin young people would cause something or not after that period As the readercan infer there are clearly ethical problems associated with performing full RCTs todetermine causes and effects

It is also important to let clear that there are cases in which it would be impossi-ble to analyze previously collected data without having in mind the RCT methodThese cases are known as observational data Observational data is defined as infor-mation that can be obtained from previously collected situations in which a formal

2 Chapter 1 Introduction

randomized trial method has not been applied but it is still important to try to de-termine causes and effects from that data This is the case in most organizations inthe present since they would possibly be collecting massive amounts of data duringthe last decades but they could not or would not be able to establish a RCT processduring collection of the metrics Moreover sometimes the data collection processhappens in a non randomized controlled experiment For example a not so com-mon disease that just affects a small percentage of the population might happen toappear within a wide range of people that makes the inference process difficult

In the past causal Inference methods have been a Statisticians only field How-ever with recent advances of machine learning algorithms more computer scientistsand machine learning engineers have been trying to infer causes and relationshipsthrough traditional and new machine learning techniques

The ability to learn complex non-linear relationships of some machine learning algo-rithms have been trying to detect and predict policies in which given the particularfeatures of an individual (patient) the algorithm could determine if to apply or notthe treatment (action) to them This concept is also known as Individual Treatmenteffect estimation or it could also be referred as Policy risk when predicting a binaryclass which is is to apply or not the treatment through the discovery of a certainthreshold

It is a matter of interest to be able to predict the individual (customized) treatmenteffect because this would lead to better decisions(actions or treatments) specificallyshaped for each person and not only relaying on the average of the whole studiedpopulation

The ultimately motivation of this work is to be able to predict the IndividualTreatment Effects for new patients with the previously collected data using al-ternative machine learning techniques to the ones used in past research effortsMoreover the compiling of a literature that can be understood by more computerscientist will be tried to present as well as running code of all the experiments per-formed will be released for others to build upon it

12 Purpose and Research Question

The purpose of this dissertation is to predict the Individual Treatment Effect (ITE)the Average Treatment Effect (ATE) and the Precision in Estimation of Heteroge-neous Effect (PEHE) for a widely adopted benchmark dataset in the field usuallyrefer to this as IHDP (Infant Health Developed Program) which is a semi-syntheticdataset particularly unbalanced and created for the task of causal inference on ob-servational data(Hill 2011)

The research experiments are about trying alternative machine learning methodswithout adding extra complexity custom error loss or custom metric functions whilelearning and predicting would be able to obtain similar or even better results thanthe state-of-the-art metrics based on the exact same benchmark dataset

13 Approach and Methodology 3

13 Approach and Methodology

For causality analysis purposes the Rubin Potential Outcomes Framework (CausalInference Using Potential Outcomes Design Modeling Decisions) and its notation willbe used during this whole thesis This model is also known as the Newman-Rubincausal model and it is an approach to statistical analysis of cause and effect based onthe potential outcomes framework

For the machine learning experiments the latest version of the scikit-learn(Userguide contents mdash scikit-learn 0192 documentation) python framework had been usedAll their underlaying methods and default hyper-parameters had been used Alsothe mathematical notation of their documentation will be presented to describe eachalgorithmrsquos functions and limitations

The different algorithms had been tested with a benchmark standard dataset for causalinference from observational data Infant Health and Development Program (IHDP)introduced by (Hill 2011) as a semi synthetic-dataset based on real features ob-tained from a real an observational study (Gross 1993) Replications on this datasethad been created to get 10 100 and 1000 cases to be able to train and predict the ma-chine learning models on them and get the desired metric error results afterwards

It is important to notice that the testing method and the metrics used to determinethe effectiveness of each algorithm are different from the ones normally used to testmachine learning algorithms and testing these algorithms within a causal frame-work differs substantially from the usual traintest paradigm of the machine learn-ing field

Since a synthetic-dataset has been used the real Individual Treatment Effect is avail-able to perform testing metrics Therefore the experiment results will consist ofthe performance of each algorithm based on Individual Treatment Effect AverageTreatment Effect (ATE) and Precision in Estimation of Heterogeneous Effect (PEHE)These three are the metrics displayed in the Experiments chapter for each algorithmtrained A detailed explanation of these formulas will be found in the next sections203

Also it is important to notice that the machine learning algorithms are trained justusing the treatment applied (observed) the features (covariates in causal inferenceliterature) and the observed outcome (usually known as Y or Y factual) Aftertraining an completely unseen dataset during is used for testing purposes

The already trained algorithm predicts just the Y factual based on the unit (alsoknown as patient) features (covariates) for the both the cases the unit would havetaken the treatment and likewise predicts the outcomes as if the unit would havetaken the control treatment Once both outcomes are predicted (Y factual and Ycounterfactual the ATE ITE and PEHE metrics are calculated In addition an aver-age score and its deviation for each run of the 10 100 and 1000 replications of theIHDP had been run to evaluate these errors in bigger simulated scenarios

The mathematical notation will be kept as minimum as possible to not confuse thereader with unnecessary information

4 Chapter 1 Introduction

14 Scope and Limitation

In this document the outcomes of the applied treatment to a patient will be onlyanalyzed with respect two possible actions (binary treatment) Multi-valued treat-ments are not going to be covered in the experiments nor in the developed code butthey can be easily extended to cover these cases

A binary treatment is applied but its outcome value is continuous Different to themost common used case of four possibles scenarios in which usually just two can beobserved or measured All the experiments and the code developed can be appliedto discrete outputs but other machine learning techniques could be more suitable forthis type of predictions (classification algorithms) Also there are cases in which thetask is to predict weather to apply or not the treatment to an individual (also knownas unit or patient) To predict in this cases turns out more into a classification task inwhich a threshold on the interval confidence of predicting to affirmatively apply thetreatment is usually set and validated through trial and error against several contin-uous values to determine what would be the one that predicts with best accuracyThis is call as Policy Risk on causal inference literature

This case is more similar to real world scenarios where the data was observed andfinally a decision on applying or not the treatment (action) has to be made in orderto peruse a desired result

In this work the cases in which the dataset contains outcomes in a binary form topredict weather or not to apply the treatment will not be covered

5

Chapter 2

Background

201 Rubin-Newman Causal Model

The Rubin causal model (RCM) (Rubin 2005) also known as Rubin-Newman Poten-tial Outcomes framework is a extended statistical analysis frame to model observa-tional data that Donald Rubin developed He came up with the mentioned frame-work building it on top of the original Newman method that he developed in his1923 masters thesis extending it to non randomized controlled trials (observationaldata)

The Rubin-Newman potential outcomes framework consists in

xi isin X

with an effectively applied treatment

ti isin 0 1

The two possible potential outcomes are defined by

Y0(xi) Y1(xi) isin Y

Of one of them (the one which actually happened) we can observe its factual out-come

yiF= tiY1(xi) + (1 minus ti)Y0(xi)

And let (x1 t1 y1F) (xn tn yn

F) be a unit from the factual distribution

Consequently let (x1 1 minus t1 y1CF) (xn 1 minus tn yn

CF be the counterfactual sample

Notice that all the factual outcomes yF are known whereas is never the case in anyunit for the counterfactual outcomes yCF (except for testing phase and just when thedataset is semi-synthetic or synthetic)

It will be used as interchangeable terms the expressions yF or y f t referring to factualobserved outcomes while yFC or yc f t will be pointing to counterfactual outcomes

6 Chapter 2 Background

202 The fundamental problem of causal analysis

The fundamental problem of causal analysis states that it is impossible given a unit xand assigning either the treatment t = 1 or t = 0 to that unit to observe the counterfactual outcome E[Y0∣x t = 1] or E[Y1∣x t = 0] (what would have happen or whatwould have been the outcome if the other treatment would have been given to theunit x )

However it is always possible to observe the outcome of the effectively appliedtreatment t which is represented as E[Y0∣x t = 0] or E[Y1∣x t = 1] or in shorterterms Y0 or Y1

In this dissertation the focus is on the case when the causal graph is simple andknown to be of the form (Y1 Y0) lt- x -gt t with no hidden confounders

This problem can be extended as most of the problems and applications discussedunder the Rubin-Newman Potential Outcomes Frameworks in this dissertation as amulti-treatment experiment It is important to notice that the problem of not havingaccess to the result of the counter factual outcome Yc f is even worst when extendingthis problem to multi-treatment experiments since the missing values that mattersfor better Individual Treatment Effects are increased in the total order of possibletreatments except the one applied

203 Metrics for Causality

Three well-known metrics in the causality field are reported for each implementedmachine learning technique applied

The losses that will be reported are

bull εITE Error of Individual Treatment Effect - also known as the ConditionalAverage Treatment Effect (CATE) - and it is how well or bad perform the treat-ment on one particular patient

ITE(x) ∶= E[y∣X = x t = 1] minus E[y∣X = x t = 0] = E[Yx1 minusYx0]

bull εATE Error of Average Treatment Effect as it name describes it it representthe effect that the applied treatment either t = 0 or t = 1 depending on thewhole population effectively had Note bold that as an average it can be notthe best solution to treat a new patient with this treatment since its unique char-acteristics as a unit might make them experience wrong results or no results atall

ATE ∶= E[ITE(x)] = E[δ] = E[Y1 minusY0]forallx isin X

bull εPEHE Precision in Estimation of Heterogeneous Effect is used to measure theprecision trade-off between the Individual Treatment Effect and the AverageTreatment Effect It is important to notice that this metric relates the ATE andITE predictions penalizing the predictions that had been predicted right for onemeasure but wrong or not that accurate for the other one

PEHE(x) ∶= 1N

N

sumi=1

((yi1 minus yi0) minus (yi1 minus yi0))2

Chapter 2 Background 7

204 Assumptions

To work on the results three important assumptions under the Rubin-Newmancausal Framework shall be made

bull Consistency For each unit just one of the two potential outcomes can be ob-served Hence if t = 0 then y = Y0 will be the observed outcome or factual(yF) However if the applied treatment was t = 1 afterwards y = Y1 will bethe available observed outcome or factual yF

bull Strong Ignorability Also known as no unmeasured confounders this assump-tion can be stated by (Y1 Y0) ⫫ t∣x and 0 lt p(t = 1∣x) lt 1forallx It is importantto notice that to be able to state this assumption a domain knowledge expertwould have to assess the dataset and therefore determine if there are no un-measured confounders That is the case for the dataset implemented in thiswork

bull Common Support This assumption states that for each unit x isin X there is apositive probability of being both treated (t = 1) and untreated (t = 0)

0 lt P(t = 1∣x) lt 1

205 Definitions

In causal inference from observational data several terms are used interchangeablyand might confuse the reader

This subsection should be clear before going further into this dissertation

Some common synonyms are

bull unit is the subject of the analysis the one that will be applied the treatmentpatient individual input xi xi isin X

bull covariates all the collected (observed) variable that have a direct effect on theoutcomefeatures(ML) x x isin X

bull treatment the possible different actions that can be applied to a unit Usuallybinary but can be multi-valued under the Rubin-Newman Potential OutcomesFrameworkaction t t isin 0 1 t isin 0 N

bull Outcome the measured result of applying a treatment t to a unit xobserved outcome result factual Y factual y f = yF

bull Counterfactual what would have been the result if the opposite treatment tothe effectively applied would have been applied to a unitunobserved outcome yc f t yc f YCF

206 Related Work

Potential outcomes are the framework to mathematically describe causality and coun-terfactuals (Rubin 1978)

8 Chapter 2 Background

Causality from observational data can be clearly applicable to a wide range of in-dustries eg advertisement placement selection health care systems finance oreven to improve education (Recursive Partitioning for Heterogeneous Causal Effects Hoiles and Van Der Schaar 2016 Bottou et al 2013) In particular counterfac-

tual inference in observational studies has been a topic of interested study in eco-nomics statistics health care pharmaceutical companies epidemiology and sociol-ogy (Causal Inference Using Potential Outcomes Design Modeling Decisions Morganand Winship 2014 Causal Inference Using Potential Outcomes Design Modeling De-cisions Chernozhukov et al 2016) whereas in machine learning the attention hasbeen caught not less than a decade ago (Lang 1995 Bottou et al 2013 Swaminathanand Joachims 2015a) A lot of work in machine learning had been targeted for dis-covering the underlying causal graph from collected data (Nonlinear causal discoverywith additive noise models Maathuis et al 2010 Triantafillou and Tsamardinos 2015Mooij et al 2016)

causal inference for counterfactual predictions is usually grouped by parametricnon-parametric and doubly robust methods

For parametric methods causal inference the relationships within features and ac-tions pairs and rewards by implementing one or more parameters trying to specifi-cally model the relations within context outcomes and actions (treatments) In thesemethods linear and logistic regression (Prentice 1976 Gelman and Hill 2007) ran-dom forests (Wager and Athey 2015) and regression trees (Chipman George andMcCulloch 2010) had been used in the past to complete the task For example (Wa-ger and Athey 2017) estimates ITEs by causal Forests but their asymptotic estimatesin datasets with a large number of relevant features has limitations that needs to beaddressed in future work

In non-parametric approaches the counterfactual predictions are mostly calculatedthrough a propensity score matching and re-weighting (Joachims and Swaminathan2016 Austin 2011 Rosenbaum and Rubin 1983 Rosenbaum 2002) To performdoubly robust causality is done by merging parametric and non-parametric methods(Dudik Langford and Li 2011 Jiang and Li 2015)

Double robust methods are known for merging the characteristics of both methodsA common example of this application would be propensity score weighted regres-sion (Bang and Robins 2005 Dudik Langford and Li 2011) When the treatmentassignment probability is known this method models the problem particularly welleg in off-policy evaluation or learning from bandits However in most of the casesin observational data their efficiency drops dramatically (Kang and Schafer 2007)

Machine learning for predicting Individual Treatment effects has been arisen a lotof interest during the last two years through the development of custom metricfunctions -as long as the application of other techniques on causality- with specialfocus on unbalanced treatment application datasets This refers to the sub area ofcausality which is known as causal inference from observational data Observationaldata is data that has been or is collected without the possibility of design and run aproper Randomized Controlled Trial The creation of custom distance learning met-rics and custom loss functions applied to Neural Networks had brought interesting

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 8: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

xiii

List of Tables

41 IHDP 10 replications with traditional machine learning algorithms -Within sample 20

42 IHDP 10 replications with traditional machine learning algorithms -Out-of-sample 21

43 IHDP 100 replications - Within sample 2144 IHDP 100 replications - Out-of-sample 2145 IHDP 100 replications already split dataset - Within sample 2246 IHDP 100 replications already split dataset - Out-of-sample 2247 IHDP 100 replications - No scaling - Within sample 2348 IHDP 1000 replications - No Scaling - Out-of-sample 2349 IHDP 100 replications - Scaled - Within sample 23410 IHDP 1000 replications - No Scaling - Out-of-sample 24411 IHDP 100 replications logistic regressions - Within sample 24412 IHDP 100 replications logistic regressions - Out-of-sample 24413 IHDP 100 replications SVR Hyper-parameters tunning - Within sample 25414 IHDP 100 replications SVR Hyper-parameters tunning - Out-of-sample 25415 ICML 2017 - Estimating individual treatment effect generalization

bounds and algorithms (Shalit Johansson and Sontag 2017) 26416 ICML 2017 - Estimating individual treatment effect generalization

bounds and algorithms (Shalit Johansson and Sontag 2017) 26417 Domain Adaptation Neural Networks 27

xv

List of Abbreviations

ML Machine LearningSVR Support Vector RegressorRL Reinforcement LearningNN Neural NetworksLR Linear RegressionKNN K Nearest NeighboursRCE Randomized Controlled ExperimentITE Individual Treatment EffectATE Avarage Treatment EffectPEHE Precision in Estimation of Heterogenous EffectsCATE Conditional Average Treatment EffectRCM Rubin Casual Model

1

Chapter 1

Introduction

11 Motivation

Causality is often confused with correlation Correlation does not imply causationThese inferences are often called spurious correlations and they often confuses theinference process in which humans make decisions

A common definition of Causality is still not agreed by the scientific communitynowadays

The proven scientific way to make claims about cause and effect it is to performwhat is called a Randomized Controlled Trial(RCT) In a Randomized ControlledTrial a statistically representative portion of the population that will be participatingof the experiment (trial) are exposed to a treatment(action) which could be eitherpositive - apply the treatment - or neutral (control) -giving the patient a placebo ornot treating the patient(unit) at all

All these concepts are related to medical words since the field in which RCTs areapplied the most is medical trials However it is not the only industry in whichthese concept of dragging conclusions from a trial can be done For example it iswidely used in social studies but can also be applied to make decisions on buyingselling or holding a particular stock or displaying an advertisement that generatemore sales than the others in the advertisements industry

Nevertheless the Randomized Controlled Trials are the best way to detect causaleffects the possibility of perform such scientific rigorous experiments is most ofthe times either impossible or unethical An example could be seem when tryingto detect if driving while being under the effects of alcohol can affect (or not) thedriverrsquos skills Another clear example of this is determining the causes of smokingin teenagers or young people in which to perform a RCT would involve to take twogroups of non-smoking teenagers make half of the units smoke for several years - itdepends on the experiment or the research question - and then determine if smokingin young people would cause something or not after that period As the readercan infer there are clearly ethical problems associated with performing full RCTs todetermine causes and effects

It is also important to let clear that there are cases in which it would be impossi-ble to analyze previously collected data without having in mind the RCT methodThese cases are known as observational data Observational data is defined as infor-mation that can be obtained from previously collected situations in which a formal

2 Chapter 1 Introduction

randomized trial method has not been applied but it is still important to try to de-termine causes and effects from that data This is the case in most organizations inthe present since they would possibly be collecting massive amounts of data duringthe last decades but they could not or would not be able to establish a RCT processduring collection of the metrics Moreover sometimes the data collection processhappens in a non randomized controlled experiment For example a not so com-mon disease that just affects a small percentage of the population might happen toappear within a wide range of people that makes the inference process difficult

In the past causal Inference methods have been a Statisticians only field How-ever with recent advances of machine learning algorithms more computer scientistsand machine learning engineers have been trying to infer causes and relationshipsthrough traditional and new machine learning techniques

The ability to learn complex non-linear relationships of some machine learning algo-rithms have been trying to detect and predict policies in which given the particularfeatures of an individual (patient) the algorithm could determine if to apply or notthe treatment (action) to them This concept is also known as Individual Treatmenteffect estimation or it could also be referred as Policy risk when predicting a binaryclass which is is to apply or not the treatment through the discovery of a certainthreshold

It is a matter of interest to be able to predict the individual (customized) treatmenteffect because this would lead to better decisions(actions or treatments) specificallyshaped for each person and not only relaying on the average of the whole studiedpopulation

The ultimately motivation of this work is to be able to predict the IndividualTreatment Effects for new patients with the previously collected data using al-ternative machine learning techniques to the ones used in past research effortsMoreover the compiling of a literature that can be understood by more computerscientist will be tried to present as well as running code of all the experiments per-formed will be released for others to build upon it

12 Purpose and Research Question

The purpose of this dissertation is to predict the Individual Treatment Effect (ITE)the Average Treatment Effect (ATE) and the Precision in Estimation of Heteroge-neous Effect (PEHE) for a widely adopted benchmark dataset in the field usuallyrefer to this as IHDP (Infant Health Developed Program) which is a semi-syntheticdataset particularly unbalanced and created for the task of causal inference on ob-servational data(Hill 2011)

The research experiments are about trying alternative machine learning methodswithout adding extra complexity custom error loss or custom metric functions whilelearning and predicting would be able to obtain similar or even better results thanthe state-of-the-art metrics based on the exact same benchmark dataset

13 Approach and Methodology 3

13 Approach and Methodology

For causality analysis purposes the Rubin Potential Outcomes Framework (CausalInference Using Potential Outcomes Design Modeling Decisions) and its notation willbe used during this whole thesis This model is also known as the Newman-Rubincausal model and it is an approach to statistical analysis of cause and effect based onthe potential outcomes framework

For the machine learning experiments the latest version of the scikit-learn(Userguide contents mdash scikit-learn 0192 documentation) python framework had been usedAll their underlaying methods and default hyper-parameters had been used Alsothe mathematical notation of their documentation will be presented to describe eachalgorithmrsquos functions and limitations

The different algorithms had been tested with a benchmark standard dataset for causalinference from observational data Infant Health and Development Program (IHDP)introduced by (Hill 2011) as a semi synthetic-dataset based on real features ob-tained from a real an observational study (Gross 1993) Replications on this datasethad been created to get 10 100 and 1000 cases to be able to train and predict the ma-chine learning models on them and get the desired metric error results afterwards

It is important to notice that the testing method and the metrics used to determinethe effectiveness of each algorithm are different from the ones normally used to testmachine learning algorithms and testing these algorithms within a causal frame-work differs substantially from the usual traintest paradigm of the machine learn-ing field

Since a synthetic-dataset has been used the real Individual Treatment Effect is avail-able to perform testing metrics Therefore the experiment results will consist ofthe performance of each algorithm based on Individual Treatment Effect AverageTreatment Effect (ATE) and Precision in Estimation of Heterogeneous Effect (PEHE)These three are the metrics displayed in the Experiments chapter for each algorithmtrained A detailed explanation of these formulas will be found in the next sections203

Also it is important to notice that the machine learning algorithms are trained justusing the treatment applied (observed) the features (covariates in causal inferenceliterature) and the observed outcome (usually known as Y or Y factual) Aftertraining an completely unseen dataset during is used for testing purposes

The already trained algorithm predicts just the Y factual based on the unit (alsoknown as patient) features (covariates) for the both the cases the unit would havetaken the treatment and likewise predicts the outcomes as if the unit would havetaken the control treatment Once both outcomes are predicted (Y factual and Ycounterfactual the ATE ITE and PEHE metrics are calculated In addition an aver-age score and its deviation for each run of the 10 100 and 1000 replications of theIHDP had been run to evaluate these errors in bigger simulated scenarios

The mathematical notation will be kept as minimum as possible to not confuse thereader with unnecessary information

4 Chapter 1 Introduction

14 Scope and Limitation

In this document the outcomes of the applied treatment to a patient will be onlyanalyzed with respect two possible actions (binary treatment) Multi-valued treat-ments are not going to be covered in the experiments nor in the developed code butthey can be easily extended to cover these cases

A binary treatment is applied but its outcome value is continuous Different to themost common used case of four possibles scenarios in which usually just two can beobserved or measured All the experiments and the code developed can be appliedto discrete outputs but other machine learning techniques could be more suitable forthis type of predictions (classification algorithms) Also there are cases in which thetask is to predict weather to apply or not the treatment to an individual (also knownas unit or patient) To predict in this cases turns out more into a classification task inwhich a threshold on the interval confidence of predicting to affirmatively apply thetreatment is usually set and validated through trial and error against several contin-uous values to determine what would be the one that predicts with best accuracyThis is call as Policy Risk on causal inference literature

This case is more similar to real world scenarios where the data was observed andfinally a decision on applying or not the treatment (action) has to be made in orderto peruse a desired result

In this work the cases in which the dataset contains outcomes in a binary form topredict weather or not to apply the treatment will not be covered

5

Chapter 2

Background

201 Rubin-Newman Causal Model

The Rubin causal model (RCM) (Rubin 2005) also known as Rubin-Newman Poten-tial Outcomes framework is a extended statistical analysis frame to model observa-tional data that Donald Rubin developed He came up with the mentioned frame-work building it on top of the original Newman method that he developed in his1923 masters thesis extending it to non randomized controlled trials (observationaldata)

The Rubin-Newman potential outcomes framework consists in

xi isin X

with an effectively applied treatment

ti isin 0 1

The two possible potential outcomes are defined by

Y0(xi) Y1(xi) isin Y

Of one of them (the one which actually happened) we can observe its factual out-come

yiF= tiY1(xi) + (1 minus ti)Y0(xi)

And let (x1 t1 y1F) (xn tn yn

F) be a unit from the factual distribution

Consequently let (x1 1 minus t1 y1CF) (xn 1 minus tn yn

CF be the counterfactual sample

Notice that all the factual outcomes yF are known whereas is never the case in anyunit for the counterfactual outcomes yCF (except for testing phase and just when thedataset is semi-synthetic or synthetic)

It will be used as interchangeable terms the expressions yF or y f t referring to factualobserved outcomes while yFC or yc f t will be pointing to counterfactual outcomes

6 Chapter 2 Background

202 The fundamental problem of causal analysis

The fundamental problem of causal analysis states that it is impossible given a unit xand assigning either the treatment t = 1 or t = 0 to that unit to observe the counterfactual outcome E[Y0∣x t = 1] or E[Y1∣x t = 0] (what would have happen or whatwould have been the outcome if the other treatment would have been given to theunit x )

However it is always possible to observe the outcome of the effectively appliedtreatment t which is represented as E[Y0∣x t = 0] or E[Y1∣x t = 1] or in shorterterms Y0 or Y1

In this dissertation the focus is on the case when the causal graph is simple andknown to be of the form (Y1 Y0) lt- x -gt t with no hidden confounders

This problem can be extended as most of the problems and applications discussedunder the Rubin-Newman Potential Outcomes Frameworks in this dissertation as amulti-treatment experiment It is important to notice that the problem of not havingaccess to the result of the counter factual outcome Yc f is even worst when extendingthis problem to multi-treatment experiments since the missing values that mattersfor better Individual Treatment Effects are increased in the total order of possibletreatments except the one applied

203 Metrics for Causality

Three well-known metrics in the causality field are reported for each implementedmachine learning technique applied

The losses that will be reported are

bull εITE Error of Individual Treatment Effect - also known as the ConditionalAverage Treatment Effect (CATE) - and it is how well or bad perform the treat-ment on one particular patient

ITE(x) ∶= E[y∣X = x t = 1] minus E[y∣X = x t = 0] = E[Yx1 minusYx0]

bull εATE Error of Average Treatment Effect as it name describes it it representthe effect that the applied treatment either t = 0 or t = 1 depending on thewhole population effectively had Note bold that as an average it can be notthe best solution to treat a new patient with this treatment since its unique char-acteristics as a unit might make them experience wrong results or no results atall

ATE ∶= E[ITE(x)] = E[δ] = E[Y1 minusY0]forallx isin X

bull εPEHE Precision in Estimation of Heterogeneous Effect is used to measure theprecision trade-off between the Individual Treatment Effect and the AverageTreatment Effect It is important to notice that this metric relates the ATE andITE predictions penalizing the predictions that had been predicted right for onemeasure but wrong or not that accurate for the other one

PEHE(x) ∶= 1N

N

sumi=1

((yi1 minus yi0) minus (yi1 minus yi0))2

Chapter 2 Background 7

204 Assumptions

To work on the results three important assumptions under the Rubin-Newmancausal Framework shall be made

bull Consistency For each unit just one of the two potential outcomes can be ob-served Hence if t = 0 then y = Y0 will be the observed outcome or factual(yF) However if the applied treatment was t = 1 afterwards y = Y1 will bethe available observed outcome or factual yF

bull Strong Ignorability Also known as no unmeasured confounders this assump-tion can be stated by (Y1 Y0) ⫫ t∣x and 0 lt p(t = 1∣x) lt 1forallx It is importantto notice that to be able to state this assumption a domain knowledge expertwould have to assess the dataset and therefore determine if there are no un-measured confounders That is the case for the dataset implemented in thiswork

bull Common Support This assumption states that for each unit x isin X there is apositive probability of being both treated (t = 1) and untreated (t = 0)

0 lt P(t = 1∣x) lt 1

205 Definitions

In causal inference from observational data several terms are used interchangeablyand might confuse the reader

This subsection should be clear before going further into this dissertation

Some common synonyms are

bull unit is the subject of the analysis the one that will be applied the treatmentpatient individual input xi xi isin X

bull covariates all the collected (observed) variable that have a direct effect on theoutcomefeatures(ML) x x isin X

bull treatment the possible different actions that can be applied to a unit Usuallybinary but can be multi-valued under the Rubin-Newman Potential OutcomesFrameworkaction t t isin 0 1 t isin 0 N

bull Outcome the measured result of applying a treatment t to a unit xobserved outcome result factual Y factual y f = yF

bull Counterfactual what would have been the result if the opposite treatment tothe effectively applied would have been applied to a unitunobserved outcome yc f t yc f YCF

206 Related Work

Potential outcomes are the framework to mathematically describe causality and coun-terfactuals (Rubin 1978)

8 Chapter 2 Background

Causality from observational data can be clearly applicable to a wide range of in-dustries eg advertisement placement selection health care systems finance oreven to improve education (Recursive Partitioning for Heterogeneous Causal Effects Hoiles and Van Der Schaar 2016 Bottou et al 2013) In particular counterfac-

tual inference in observational studies has been a topic of interested study in eco-nomics statistics health care pharmaceutical companies epidemiology and sociol-ogy (Causal Inference Using Potential Outcomes Design Modeling Decisions Morganand Winship 2014 Causal Inference Using Potential Outcomes Design Modeling De-cisions Chernozhukov et al 2016) whereas in machine learning the attention hasbeen caught not less than a decade ago (Lang 1995 Bottou et al 2013 Swaminathanand Joachims 2015a) A lot of work in machine learning had been targeted for dis-covering the underlying causal graph from collected data (Nonlinear causal discoverywith additive noise models Maathuis et al 2010 Triantafillou and Tsamardinos 2015Mooij et al 2016)

causal inference for counterfactual predictions is usually grouped by parametricnon-parametric and doubly robust methods

For parametric methods causal inference the relationships within features and ac-tions pairs and rewards by implementing one or more parameters trying to specifi-cally model the relations within context outcomes and actions (treatments) In thesemethods linear and logistic regression (Prentice 1976 Gelman and Hill 2007) ran-dom forests (Wager and Athey 2015) and regression trees (Chipman George andMcCulloch 2010) had been used in the past to complete the task For example (Wa-ger and Athey 2017) estimates ITEs by causal Forests but their asymptotic estimatesin datasets with a large number of relevant features has limitations that needs to beaddressed in future work

In non-parametric approaches the counterfactual predictions are mostly calculatedthrough a propensity score matching and re-weighting (Joachims and Swaminathan2016 Austin 2011 Rosenbaum and Rubin 1983 Rosenbaum 2002) To performdoubly robust causality is done by merging parametric and non-parametric methods(Dudik Langford and Li 2011 Jiang and Li 2015)

Double robust methods are known for merging the characteristics of both methodsA common example of this application would be propensity score weighted regres-sion (Bang and Robins 2005 Dudik Langford and Li 2011) When the treatmentassignment probability is known this method models the problem particularly welleg in off-policy evaluation or learning from bandits However in most of the casesin observational data their efficiency drops dramatically (Kang and Schafer 2007)

Machine learning for predicting Individual Treatment effects has been arisen a lotof interest during the last two years through the development of custom metricfunctions -as long as the application of other techniques on causality- with specialfocus on unbalanced treatment application datasets This refers to the sub area ofcausality which is known as causal inference from observational data Observationaldata is data that has been or is collected without the possibility of design and run aproper Randomized Controlled Trial The creation of custom distance learning met-rics and custom loss functions applied to Neural Networks had brought interesting

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 9: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

xv

List of Abbreviations

ML Machine LearningSVR Support Vector RegressorRL Reinforcement LearningNN Neural NetworksLR Linear RegressionKNN K Nearest NeighboursRCE Randomized Controlled ExperimentITE Individual Treatment EffectATE Avarage Treatment EffectPEHE Precision in Estimation of Heterogenous EffectsCATE Conditional Average Treatment EffectRCM Rubin Casual Model

1

Chapter 1

Introduction

11 Motivation

Causality is often confused with correlation Correlation does not imply causationThese inferences are often called spurious correlations and they often confuses theinference process in which humans make decisions

A common definition of Causality is still not agreed by the scientific communitynowadays

The proven scientific way to make claims about cause and effect it is to performwhat is called a Randomized Controlled Trial(RCT) In a Randomized ControlledTrial a statistically representative portion of the population that will be participatingof the experiment (trial) are exposed to a treatment(action) which could be eitherpositive - apply the treatment - or neutral (control) -giving the patient a placebo ornot treating the patient(unit) at all

All these concepts are related to medical words since the field in which RCTs areapplied the most is medical trials However it is not the only industry in whichthese concept of dragging conclusions from a trial can be done For example it iswidely used in social studies but can also be applied to make decisions on buyingselling or holding a particular stock or displaying an advertisement that generatemore sales than the others in the advertisements industry

Nevertheless the Randomized Controlled Trials are the best way to detect causaleffects the possibility of perform such scientific rigorous experiments is most ofthe times either impossible or unethical An example could be seem when tryingto detect if driving while being under the effects of alcohol can affect (or not) thedriverrsquos skills Another clear example of this is determining the causes of smokingin teenagers or young people in which to perform a RCT would involve to take twogroups of non-smoking teenagers make half of the units smoke for several years - itdepends on the experiment or the research question - and then determine if smokingin young people would cause something or not after that period As the readercan infer there are clearly ethical problems associated with performing full RCTs todetermine causes and effects

It is also important to let clear that there are cases in which it would be impossi-ble to analyze previously collected data without having in mind the RCT methodThese cases are known as observational data Observational data is defined as infor-mation that can be obtained from previously collected situations in which a formal

2 Chapter 1 Introduction

randomized trial method has not been applied but it is still important to try to de-termine causes and effects from that data This is the case in most organizations inthe present since they would possibly be collecting massive amounts of data duringthe last decades but they could not or would not be able to establish a RCT processduring collection of the metrics Moreover sometimes the data collection processhappens in a non randomized controlled experiment For example a not so com-mon disease that just affects a small percentage of the population might happen toappear within a wide range of people that makes the inference process difficult

In the past causal Inference methods have been a Statisticians only field How-ever with recent advances of machine learning algorithms more computer scientistsand machine learning engineers have been trying to infer causes and relationshipsthrough traditional and new machine learning techniques

The ability to learn complex non-linear relationships of some machine learning algo-rithms have been trying to detect and predict policies in which given the particularfeatures of an individual (patient) the algorithm could determine if to apply or notthe treatment (action) to them This concept is also known as Individual Treatmenteffect estimation or it could also be referred as Policy risk when predicting a binaryclass which is is to apply or not the treatment through the discovery of a certainthreshold

It is a matter of interest to be able to predict the individual (customized) treatmenteffect because this would lead to better decisions(actions or treatments) specificallyshaped for each person and not only relaying on the average of the whole studiedpopulation

The ultimately motivation of this work is to be able to predict the IndividualTreatment Effects for new patients with the previously collected data using al-ternative machine learning techniques to the ones used in past research effortsMoreover the compiling of a literature that can be understood by more computerscientist will be tried to present as well as running code of all the experiments per-formed will be released for others to build upon it

12 Purpose and Research Question

The purpose of this dissertation is to predict the Individual Treatment Effect (ITE)the Average Treatment Effect (ATE) and the Precision in Estimation of Heteroge-neous Effect (PEHE) for a widely adopted benchmark dataset in the field usuallyrefer to this as IHDP (Infant Health Developed Program) which is a semi-syntheticdataset particularly unbalanced and created for the task of causal inference on ob-servational data(Hill 2011)

The research experiments are about trying alternative machine learning methodswithout adding extra complexity custom error loss or custom metric functions whilelearning and predicting would be able to obtain similar or even better results thanthe state-of-the-art metrics based on the exact same benchmark dataset

13 Approach and Methodology 3

13 Approach and Methodology

For causality analysis purposes the Rubin Potential Outcomes Framework (CausalInference Using Potential Outcomes Design Modeling Decisions) and its notation willbe used during this whole thesis This model is also known as the Newman-Rubincausal model and it is an approach to statistical analysis of cause and effect based onthe potential outcomes framework

For the machine learning experiments the latest version of the scikit-learn(Userguide contents mdash scikit-learn 0192 documentation) python framework had been usedAll their underlaying methods and default hyper-parameters had been used Alsothe mathematical notation of their documentation will be presented to describe eachalgorithmrsquos functions and limitations

The different algorithms had been tested with a benchmark standard dataset for causalinference from observational data Infant Health and Development Program (IHDP)introduced by (Hill 2011) as a semi synthetic-dataset based on real features ob-tained from a real an observational study (Gross 1993) Replications on this datasethad been created to get 10 100 and 1000 cases to be able to train and predict the ma-chine learning models on them and get the desired metric error results afterwards

It is important to notice that the testing method and the metrics used to determinethe effectiveness of each algorithm are different from the ones normally used to testmachine learning algorithms and testing these algorithms within a causal frame-work differs substantially from the usual traintest paradigm of the machine learn-ing field

Since a synthetic-dataset has been used the real Individual Treatment Effect is avail-able to perform testing metrics Therefore the experiment results will consist ofthe performance of each algorithm based on Individual Treatment Effect AverageTreatment Effect (ATE) and Precision in Estimation of Heterogeneous Effect (PEHE)These three are the metrics displayed in the Experiments chapter for each algorithmtrained A detailed explanation of these formulas will be found in the next sections203

Also it is important to notice that the machine learning algorithms are trained justusing the treatment applied (observed) the features (covariates in causal inferenceliterature) and the observed outcome (usually known as Y or Y factual) Aftertraining an completely unseen dataset during is used for testing purposes

The already trained algorithm predicts just the Y factual based on the unit (alsoknown as patient) features (covariates) for the both the cases the unit would havetaken the treatment and likewise predicts the outcomes as if the unit would havetaken the control treatment Once both outcomes are predicted (Y factual and Ycounterfactual the ATE ITE and PEHE metrics are calculated In addition an aver-age score and its deviation for each run of the 10 100 and 1000 replications of theIHDP had been run to evaluate these errors in bigger simulated scenarios

The mathematical notation will be kept as minimum as possible to not confuse thereader with unnecessary information

4 Chapter 1 Introduction

14 Scope and Limitation

In this document the outcomes of the applied treatment to a patient will be onlyanalyzed with respect two possible actions (binary treatment) Multi-valued treat-ments are not going to be covered in the experiments nor in the developed code butthey can be easily extended to cover these cases

A binary treatment is applied but its outcome value is continuous Different to themost common used case of four possibles scenarios in which usually just two can beobserved or measured All the experiments and the code developed can be appliedto discrete outputs but other machine learning techniques could be more suitable forthis type of predictions (classification algorithms) Also there are cases in which thetask is to predict weather to apply or not the treatment to an individual (also knownas unit or patient) To predict in this cases turns out more into a classification task inwhich a threshold on the interval confidence of predicting to affirmatively apply thetreatment is usually set and validated through trial and error against several contin-uous values to determine what would be the one that predicts with best accuracyThis is call as Policy Risk on causal inference literature

This case is more similar to real world scenarios where the data was observed andfinally a decision on applying or not the treatment (action) has to be made in orderto peruse a desired result

In this work the cases in which the dataset contains outcomes in a binary form topredict weather or not to apply the treatment will not be covered

5

Chapter 2

Background

201 Rubin-Newman Causal Model

The Rubin causal model (RCM) (Rubin 2005) also known as Rubin-Newman Poten-tial Outcomes framework is a extended statistical analysis frame to model observa-tional data that Donald Rubin developed He came up with the mentioned frame-work building it on top of the original Newman method that he developed in his1923 masters thesis extending it to non randomized controlled trials (observationaldata)

The Rubin-Newman potential outcomes framework consists in

xi isin X

with an effectively applied treatment

ti isin 0 1

The two possible potential outcomes are defined by

Y0(xi) Y1(xi) isin Y

Of one of them (the one which actually happened) we can observe its factual out-come

yiF= tiY1(xi) + (1 minus ti)Y0(xi)

And let (x1 t1 y1F) (xn tn yn

F) be a unit from the factual distribution

Consequently let (x1 1 minus t1 y1CF) (xn 1 minus tn yn

CF be the counterfactual sample

Notice that all the factual outcomes yF are known whereas is never the case in anyunit for the counterfactual outcomes yCF (except for testing phase and just when thedataset is semi-synthetic or synthetic)

It will be used as interchangeable terms the expressions yF or y f t referring to factualobserved outcomes while yFC or yc f t will be pointing to counterfactual outcomes

6 Chapter 2 Background

202 The fundamental problem of causal analysis

The fundamental problem of causal analysis states that it is impossible given a unit xand assigning either the treatment t = 1 or t = 0 to that unit to observe the counterfactual outcome E[Y0∣x t = 1] or E[Y1∣x t = 0] (what would have happen or whatwould have been the outcome if the other treatment would have been given to theunit x )

However it is always possible to observe the outcome of the effectively appliedtreatment t which is represented as E[Y0∣x t = 0] or E[Y1∣x t = 1] or in shorterterms Y0 or Y1

In this dissertation the focus is on the case when the causal graph is simple andknown to be of the form (Y1 Y0) lt- x -gt t with no hidden confounders

This problem can be extended as most of the problems and applications discussedunder the Rubin-Newman Potential Outcomes Frameworks in this dissertation as amulti-treatment experiment It is important to notice that the problem of not havingaccess to the result of the counter factual outcome Yc f is even worst when extendingthis problem to multi-treatment experiments since the missing values that mattersfor better Individual Treatment Effects are increased in the total order of possibletreatments except the one applied

203 Metrics for Causality

Three well-known metrics in the causality field are reported for each implementedmachine learning technique applied

The losses that will be reported are

bull εITE Error of Individual Treatment Effect - also known as the ConditionalAverage Treatment Effect (CATE) - and it is how well or bad perform the treat-ment on one particular patient

ITE(x) ∶= E[y∣X = x t = 1] minus E[y∣X = x t = 0] = E[Yx1 minusYx0]

bull εATE Error of Average Treatment Effect as it name describes it it representthe effect that the applied treatment either t = 0 or t = 1 depending on thewhole population effectively had Note bold that as an average it can be notthe best solution to treat a new patient with this treatment since its unique char-acteristics as a unit might make them experience wrong results or no results atall

ATE ∶= E[ITE(x)] = E[δ] = E[Y1 minusY0]forallx isin X

bull εPEHE Precision in Estimation of Heterogeneous Effect is used to measure theprecision trade-off between the Individual Treatment Effect and the AverageTreatment Effect It is important to notice that this metric relates the ATE andITE predictions penalizing the predictions that had been predicted right for onemeasure but wrong or not that accurate for the other one

PEHE(x) ∶= 1N

N

sumi=1

((yi1 minus yi0) minus (yi1 minus yi0))2

Chapter 2 Background 7

204 Assumptions

To work on the results three important assumptions under the Rubin-Newmancausal Framework shall be made

bull Consistency For each unit just one of the two potential outcomes can be ob-served Hence if t = 0 then y = Y0 will be the observed outcome or factual(yF) However if the applied treatment was t = 1 afterwards y = Y1 will bethe available observed outcome or factual yF

bull Strong Ignorability Also known as no unmeasured confounders this assump-tion can be stated by (Y1 Y0) ⫫ t∣x and 0 lt p(t = 1∣x) lt 1forallx It is importantto notice that to be able to state this assumption a domain knowledge expertwould have to assess the dataset and therefore determine if there are no un-measured confounders That is the case for the dataset implemented in thiswork

bull Common Support This assumption states that for each unit x isin X there is apositive probability of being both treated (t = 1) and untreated (t = 0)

0 lt P(t = 1∣x) lt 1

205 Definitions

In causal inference from observational data several terms are used interchangeablyand might confuse the reader

This subsection should be clear before going further into this dissertation

Some common synonyms are

bull unit is the subject of the analysis the one that will be applied the treatmentpatient individual input xi xi isin X

bull covariates all the collected (observed) variable that have a direct effect on theoutcomefeatures(ML) x x isin X

bull treatment the possible different actions that can be applied to a unit Usuallybinary but can be multi-valued under the Rubin-Newman Potential OutcomesFrameworkaction t t isin 0 1 t isin 0 N

bull Outcome the measured result of applying a treatment t to a unit xobserved outcome result factual Y factual y f = yF

bull Counterfactual what would have been the result if the opposite treatment tothe effectively applied would have been applied to a unitunobserved outcome yc f t yc f YCF

206 Related Work

Potential outcomes are the framework to mathematically describe causality and coun-terfactuals (Rubin 1978)

8 Chapter 2 Background

Causality from observational data can be clearly applicable to a wide range of in-dustries eg advertisement placement selection health care systems finance oreven to improve education (Recursive Partitioning for Heterogeneous Causal Effects Hoiles and Van Der Schaar 2016 Bottou et al 2013) In particular counterfac-

tual inference in observational studies has been a topic of interested study in eco-nomics statistics health care pharmaceutical companies epidemiology and sociol-ogy (Causal Inference Using Potential Outcomes Design Modeling Decisions Morganand Winship 2014 Causal Inference Using Potential Outcomes Design Modeling De-cisions Chernozhukov et al 2016) whereas in machine learning the attention hasbeen caught not less than a decade ago (Lang 1995 Bottou et al 2013 Swaminathanand Joachims 2015a) A lot of work in machine learning had been targeted for dis-covering the underlying causal graph from collected data (Nonlinear causal discoverywith additive noise models Maathuis et al 2010 Triantafillou and Tsamardinos 2015Mooij et al 2016)

causal inference for counterfactual predictions is usually grouped by parametricnon-parametric and doubly robust methods

For parametric methods causal inference the relationships within features and ac-tions pairs and rewards by implementing one or more parameters trying to specifi-cally model the relations within context outcomes and actions (treatments) In thesemethods linear and logistic regression (Prentice 1976 Gelman and Hill 2007) ran-dom forests (Wager and Athey 2015) and regression trees (Chipman George andMcCulloch 2010) had been used in the past to complete the task For example (Wa-ger and Athey 2017) estimates ITEs by causal Forests but their asymptotic estimatesin datasets with a large number of relevant features has limitations that needs to beaddressed in future work

In non-parametric approaches the counterfactual predictions are mostly calculatedthrough a propensity score matching and re-weighting (Joachims and Swaminathan2016 Austin 2011 Rosenbaum and Rubin 1983 Rosenbaum 2002) To performdoubly robust causality is done by merging parametric and non-parametric methods(Dudik Langford and Li 2011 Jiang and Li 2015)

Double robust methods are known for merging the characteristics of both methodsA common example of this application would be propensity score weighted regres-sion (Bang and Robins 2005 Dudik Langford and Li 2011) When the treatmentassignment probability is known this method models the problem particularly welleg in off-policy evaluation or learning from bandits However in most of the casesin observational data their efficiency drops dramatically (Kang and Schafer 2007)

Machine learning for predicting Individual Treatment effects has been arisen a lotof interest during the last two years through the development of custom metricfunctions -as long as the application of other techniques on causality- with specialfocus on unbalanced treatment application datasets This refers to the sub area ofcausality which is known as causal inference from observational data Observationaldata is data that has been or is collected without the possibility of design and run aproper Randomized Controlled Trial The creation of custom distance learning met-rics and custom loss functions applied to Neural Networks had brought interesting

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 10: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

1

Chapter 1

Introduction

11 Motivation

Causality is often confused with correlation Correlation does not imply causationThese inferences are often called spurious correlations and they often confuses theinference process in which humans make decisions

A common definition of Causality is still not agreed by the scientific communitynowadays

The proven scientific way to make claims about cause and effect it is to performwhat is called a Randomized Controlled Trial(RCT) In a Randomized ControlledTrial a statistically representative portion of the population that will be participatingof the experiment (trial) are exposed to a treatment(action) which could be eitherpositive - apply the treatment - or neutral (control) -giving the patient a placebo ornot treating the patient(unit) at all

All these concepts are related to medical words since the field in which RCTs areapplied the most is medical trials However it is not the only industry in whichthese concept of dragging conclusions from a trial can be done For example it iswidely used in social studies but can also be applied to make decisions on buyingselling or holding a particular stock or displaying an advertisement that generatemore sales than the others in the advertisements industry

Nevertheless the Randomized Controlled Trials are the best way to detect causaleffects the possibility of perform such scientific rigorous experiments is most ofthe times either impossible or unethical An example could be seem when tryingto detect if driving while being under the effects of alcohol can affect (or not) thedriverrsquos skills Another clear example of this is determining the causes of smokingin teenagers or young people in which to perform a RCT would involve to take twogroups of non-smoking teenagers make half of the units smoke for several years - itdepends on the experiment or the research question - and then determine if smokingin young people would cause something or not after that period As the readercan infer there are clearly ethical problems associated with performing full RCTs todetermine causes and effects

It is also important to let clear that there are cases in which it would be impossi-ble to analyze previously collected data without having in mind the RCT methodThese cases are known as observational data Observational data is defined as infor-mation that can be obtained from previously collected situations in which a formal

2 Chapter 1 Introduction

randomized trial method has not been applied but it is still important to try to de-termine causes and effects from that data This is the case in most organizations inthe present since they would possibly be collecting massive amounts of data duringthe last decades but they could not or would not be able to establish a RCT processduring collection of the metrics Moreover sometimes the data collection processhappens in a non randomized controlled experiment For example a not so com-mon disease that just affects a small percentage of the population might happen toappear within a wide range of people that makes the inference process difficult

In the past causal Inference methods have been a Statisticians only field How-ever with recent advances of machine learning algorithms more computer scientistsand machine learning engineers have been trying to infer causes and relationshipsthrough traditional and new machine learning techniques

The ability to learn complex non-linear relationships of some machine learning algo-rithms have been trying to detect and predict policies in which given the particularfeatures of an individual (patient) the algorithm could determine if to apply or notthe treatment (action) to them This concept is also known as Individual Treatmenteffect estimation or it could also be referred as Policy risk when predicting a binaryclass which is is to apply or not the treatment through the discovery of a certainthreshold

It is a matter of interest to be able to predict the individual (customized) treatmenteffect because this would lead to better decisions(actions or treatments) specificallyshaped for each person and not only relaying on the average of the whole studiedpopulation

The ultimately motivation of this work is to be able to predict the IndividualTreatment Effects for new patients with the previously collected data using al-ternative machine learning techniques to the ones used in past research effortsMoreover the compiling of a literature that can be understood by more computerscientist will be tried to present as well as running code of all the experiments per-formed will be released for others to build upon it

12 Purpose and Research Question

The purpose of this dissertation is to predict the Individual Treatment Effect (ITE)the Average Treatment Effect (ATE) and the Precision in Estimation of Heteroge-neous Effect (PEHE) for a widely adopted benchmark dataset in the field usuallyrefer to this as IHDP (Infant Health Developed Program) which is a semi-syntheticdataset particularly unbalanced and created for the task of causal inference on ob-servational data(Hill 2011)

The research experiments are about trying alternative machine learning methodswithout adding extra complexity custom error loss or custom metric functions whilelearning and predicting would be able to obtain similar or even better results thanthe state-of-the-art metrics based on the exact same benchmark dataset

13 Approach and Methodology 3

13 Approach and Methodology

For causality analysis purposes the Rubin Potential Outcomes Framework (CausalInference Using Potential Outcomes Design Modeling Decisions) and its notation willbe used during this whole thesis This model is also known as the Newman-Rubincausal model and it is an approach to statistical analysis of cause and effect based onthe potential outcomes framework

For the machine learning experiments the latest version of the scikit-learn(Userguide contents mdash scikit-learn 0192 documentation) python framework had been usedAll their underlaying methods and default hyper-parameters had been used Alsothe mathematical notation of their documentation will be presented to describe eachalgorithmrsquos functions and limitations

The different algorithms had been tested with a benchmark standard dataset for causalinference from observational data Infant Health and Development Program (IHDP)introduced by (Hill 2011) as a semi synthetic-dataset based on real features ob-tained from a real an observational study (Gross 1993) Replications on this datasethad been created to get 10 100 and 1000 cases to be able to train and predict the ma-chine learning models on them and get the desired metric error results afterwards

It is important to notice that the testing method and the metrics used to determinethe effectiveness of each algorithm are different from the ones normally used to testmachine learning algorithms and testing these algorithms within a causal frame-work differs substantially from the usual traintest paradigm of the machine learn-ing field

Since a synthetic-dataset has been used the real Individual Treatment Effect is avail-able to perform testing metrics Therefore the experiment results will consist ofthe performance of each algorithm based on Individual Treatment Effect AverageTreatment Effect (ATE) and Precision in Estimation of Heterogeneous Effect (PEHE)These three are the metrics displayed in the Experiments chapter for each algorithmtrained A detailed explanation of these formulas will be found in the next sections203

Also it is important to notice that the machine learning algorithms are trained justusing the treatment applied (observed) the features (covariates in causal inferenceliterature) and the observed outcome (usually known as Y or Y factual) Aftertraining an completely unseen dataset during is used for testing purposes

The already trained algorithm predicts just the Y factual based on the unit (alsoknown as patient) features (covariates) for the both the cases the unit would havetaken the treatment and likewise predicts the outcomes as if the unit would havetaken the control treatment Once both outcomes are predicted (Y factual and Ycounterfactual the ATE ITE and PEHE metrics are calculated In addition an aver-age score and its deviation for each run of the 10 100 and 1000 replications of theIHDP had been run to evaluate these errors in bigger simulated scenarios

The mathematical notation will be kept as minimum as possible to not confuse thereader with unnecessary information

4 Chapter 1 Introduction

14 Scope and Limitation

In this document the outcomes of the applied treatment to a patient will be onlyanalyzed with respect two possible actions (binary treatment) Multi-valued treat-ments are not going to be covered in the experiments nor in the developed code butthey can be easily extended to cover these cases

A binary treatment is applied but its outcome value is continuous Different to themost common used case of four possibles scenarios in which usually just two can beobserved or measured All the experiments and the code developed can be appliedto discrete outputs but other machine learning techniques could be more suitable forthis type of predictions (classification algorithms) Also there are cases in which thetask is to predict weather to apply or not the treatment to an individual (also knownas unit or patient) To predict in this cases turns out more into a classification task inwhich a threshold on the interval confidence of predicting to affirmatively apply thetreatment is usually set and validated through trial and error against several contin-uous values to determine what would be the one that predicts with best accuracyThis is call as Policy Risk on causal inference literature

This case is more similar to real world scenarios where the data was observed andfinally a decision on applying or not the treatment (action) has to be made in orderto peruse a desired result

In this work the cases in which the dataset contains outcomes in a binary form topredict weather or not to apply the treatment will not be covered

5

Chapter 2

Background

201 Rubin-Newman Causal Model

The Rubin causal model (RCM) (Rubin 2005) also known as Rubin-Newman Poten-tial Outcomes framework is a extended statistical analysis frame to model observa-tional data that Donald Rubin developed He came up with the mentioned frame-work building it on top of the original Newman method that he developed in his1923 masters thesis extending it to non randomized controlled trials (observationaldata)

The Rubin-Newman potential outcomes framework consists in

xi isin X

with an effectively applied treatment

ti isin 0 1

The two possible potential outcomes are defined by

Y0(xi) Y1(xi) isin Y

Of one of them (the one which actually happened) we can observe its factual out-come

yiF= tiY1(xi) + (1 minus ti)Y0(xi)

And let (x1 t1 y1F) (xn tn yn

F) be a unit from the factual distribution

Consequently let (x1 1 minus t1 y1CF) (xn 1 minus tn yn

CF be the counterfactual sample

Notice that all the factual outcomes yF are known whereas is never the case in anyunit for the counterfactual outcomes yCF (except for testing phase and just when thedataset is semi-synthetic or synthetic)

It will be used as interchangeable terms the expressions yF or y f t referring to factualobserved outcomes while yFC or yc f t will be pointing to counterfactual outcomes

6 Chapter 2 Background

202 The fundamental problem of causal analysis

The fundamental problem of causal analysis states that it is impossible given a unit xand assigning either the treatment t = 1 or t = 0 to that unit to observe the counterfactual outcome E[Y0∣x t = 1] or E[Y1∣x t = 0] (what would have happen or whatwould have been the outcome if the other treatment would have been given to theunit x )

However it is always possible to observe the outcome of the effectively appliedtreatment t which is represented as E[Y0∣x t = 0] or E[Y1∣x t = 1] or in shorterterms Y0 or Y1

In this dissertation the focus is on the case when the causal graph is simple andknown to be of the form (Y1 Y0) lt- x -gt t with no hidden confounders

This problem can be extended as most of the problems and applications discussedunder the Rubin-Newman Potential Outcomes Frameworks in this dissertation as amulti-treatment experiment It is important to notice that the problem of not havingaccess to the result of the counter factual outcome Yc f is even worst when extendingthis problem to multi-treatment experiments since the missing values that mattersfor better Individual Treatment Effects are increased in the total order of possibletreatments except the one applied

203 Metrics for Causality

Three well-known metrics in the causality field are reported for each implementedmachine learning technique applied

The losses that will be reported are

bull εITE Error of Individual Treatment Effect - also known as the ConditionalAverage Treatment Effect (CATE) - and it is how well or bad perform the treat-ment on one particular patient

ITE(x) ∶= E[y∣X = x t = 1] minus E[y∣X = x t = 0] = E[Yx1 minusYx0]

bull εATE Error of Average Treatment Effect as it name describes it it representthe effect that the applied treatment either t = 0 or t = 1 depending on thewhole population effectively had Note bold that as an average it can be notthe best solution to treat a new patient with this treatment since its unique char-acteristics as a unit might make them experience wrong results or no results atall

ATE ∶= E[ITE(x)] = E[δ] = E[Y1 minusY0]forallx isin X

bull εPEHE Precision in Estimation of Heterogeneous Effect is used to measure theprecision trade-off between the Individual Treatment Effect and the AverageTreatment Effect It is important to notice that this metric relates the ATE andITE predictions penalizing the predictions that had been predicted right for onemeasure but wrong or not that accurate for the other one

PEHE(x) ∶= 1N

N

sumi=1

((yi1 minus yi0) minus (yi1 minus yi0))2

Chapter 2 Background 7

204 Assumptions

To work on the results three important assumptions under the Rubin-Newmancausal Framework shall be made

bull Consistency For each unit just one of the two potential outcomes can be ob-served Hence if t = 0 then y = Y0 will be the observed outcome or factual(yF) However if the applied treatment was t = 1 afterwards y = Y1 will bethe available observed outcome or factual yF

bull Strong Ignorability Also known as no unmeasured confounders this assump-tion can be stated by (Y1 Y0) ⫫ t∣x and 0 lt p(t = 1∣x) lt 1forallx It is importantto notice that to be able to state this assumption a domain knowledge expertwould have to assess the dataset and therefore determine if there are no un-measured confounders That is the case for the dataset implemented in thiswork

bull Common Support This assumption states that for each unit x isin X there is apositive probability of being both treated (t = 1) and untreated (t = 0)

0 lt P(t = 1∣x) lt 1

205 Definitions

In causal inference from observational data several terms are used interchangeablyand might confuse the reader

This subsection should be clear before going further into this dissertation

Some common synonyms are

bull unit is the subject of the analysis the one that will be applied the treatmentpatient individual input xi xi isin X

bull covariates all the collected (observed) variable that have a direct effect on theoutcomefeatures(ML) x x isin X

bull treatment the possible different actions that can be applied to a unit Usuallybinary but can be multi-valued under the Rubin-Newman Potential OutcomesFrameworkaction t t isin 0 1 t isin 0 N

bull Outcome the measured result of applying a treatment t to a unit xobserved outcome result factual Y factual y f = yF

bull Counterfactual what would have been the result if the opposite treatment tothe effectively applied would have been applied to a unitunobserved outcome yc f t yc f YCF

206 Related Work

Potential outcomes are the framework to mathematically describe causality and coun-terfactuals (Rubin 1978)

8 Chapter 2 Background

Causality from observational data can be clearly applicable to a wide range of in-dustries eg advertisement placement selection health care systems finance oreven to improve education (Recursive Partitioning for Heterogeneous Causal Effects Hoiles and Van Der Schaar 2016 Bottou et al 2013) In particular counterfac-

tual inference in observational studies has been a topic of interested study in eco-nomics statistics health care pharmaceutical companies epidemiology and sociol-ogy (Causal Inference Using Potential Outcomes Design Modeling Decisions Morganand Winship 2014 Causal Inference Using Potential Outcomes Design Modeling De-cisions Chernozhukov et al 2016) whereas in machine learning the attention hasbeen caught not less than a decade ago (Lang 1995 Bottou et al 2013 Swaminathanand Joachims 2015a) A lot of work in machine learning had been targeted for dis-covering the underlying causal graph from collected data (Nonlinear causal discoverywith additive noise models Maathuis et al 2010 Triantafillou and Tsamardinos 2015Mooij et al 2016)

causal inference for counterfactual predictions is usually grouped by parametricnon-parametric and doubly robust methods

For parametric methods causal inference the relationships within features and ac-tions pairs and rewards by implementing one or more parameters trying to specifi-cally model the relations within context outcomes and actions (treatments) In thesemethods linear and logistic regression (Prentice 1976 Gelman and Hill 2007) ran-dom forests (Wager and Athey 2015) and regression trees (Chipman George andMcCulloch 2010) had been used in the past to complete the task For example (Wa-ger and Athey 2017) estimates ITEs by causal Forests but their asymptotic estimatesin datasets with a large number of relevant features has limitations that needs to beaddressed in future work

In non-parametric approaches the counterfactual predictions are mostly calculatedthrough a propensity score matching and re-weighting (Joachims and Swaminathan2016 Austin 2011 Rosenbaum and Rubin 1983 Rosenbaum 2002) To performdoubly robust causality is done by merging parametric and non-parametric methods(Dudik Langford and Li 2011 Jiang and Li 2015)

Double robust methods are known for merging the characteristics of both methodsA common example of this application would be propensity score weighted regres-sion (Bang and Robins 2005 Dudik Langford and Li 2011) When the treatmentassignment probability is known this method models the problem particularly welleg in off-policy evaluation or learning from bandits However in most of the casesin observational data their efficiency drops dramatically (Kang and Schafer 2007)

Machine learning for predicting Individual Treatment effects has been arisen a lotof interest during the last two years through the development of custom metricfunctions -as long as the application of other techniques on causality- with specialfocus on unbalanced treatment application datasets This refers to the sub area ofcausality which is known as causal inference from observational data Observationaldata is data that has been or is collected without the possibility of design and run aproper Randomized Controlled Trial The creation of custom distance learning met-rics and custom loss functions applied to Neural Networks had brought interesting

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 11: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

2 Chapter 1 Introduction

randomized trial method has not been applied but it is still important to try to de-termine causes and effects from that data This is the case in most organizations inthe present since they would possibly be collecting massive amounts of data duringthe last decades but they could not or would not be able to establish a RCT processduring collection of the metrics Moreover sometimes the data collection processhappens in a non randomized controlled experiment For example a not so com-mon disease that just affects a small percentage of the population might happen toappear within a wide range of people that makes the inference process difficult

In the past causal Inference methods have been a Statisticians only field How-ever with recent advances of machine learning algorithms more computer scientistsand machine learning engineers have been trying to infer causes and relationshipsthrough traditional and new machine learning techniques

The ability to learn complex non-linear relationships of some machine learning algo-rithms have been trying to detect and predict policies in which given the particularfeatures of an individual (patient) the algorithm could determine if to apply or notthe treatment (action) to them This concept is also known as Individual Treatmenteffect estimation or it could also be referred as Policy risk when predicting a binaryclass which is is to apply or not the treatment through the discovery of a certainthreshold

It is a matter of interest to be able to predict the individual (customized) treatmenteffect because this would lead to better decisions(actions or treatments) specificallyshaped for each person and not only relaying on the average of the whole studiedpopulation

The ultimately motivation of this work is to be able to predict the IndividualTreatment Effects for new patients with the previously collected data using al-ternative machine learning techniques to the ones used in past research effortsMoreover the compiling of a literature that can be understood by more computerscientist will be tried to present as well as running code of all the experiments per-formed will be released for others to build upon it

12 Purpose and Research Question

The purpose of this dissertation is to predict the Individual Treatment Effect (ITE)the Average Treatment Effect (ATE) and the Precision in Estimation of Heteroge-neous Effect (PEHE) for a widely adopted benchmark dataset in the field usuallyrefer to this as IHDP (Infant Health Developed Program) which is a semi-syntheticdataset particularly unbalanced and created for the task of causal inference on ob-servational data(Hill 2011)

The research experiments are about trying alternative machine learning methodswithout adding extra complexity custom error loss or custom metric functions whilelearning and predicting would be able to obtain similar or even better results thanthe state-of-the-art metrics based on the exact same benchmark dataset

13 Approach and Methodology 3

13 Approach and Methodology

For causality analysis purposes the Rubin Potential Outcomes Framework (CausalInference Using Potential Outcomes Design Modeling Decisions) and its notation willbe used during this whole thesis This model is also known as the Newman-Rubincausal model and it is an approach to statistical analysis of cause and effect based onthe potential outcomes framework

For the machine learning experiments the latest version of the scikit-learn(Userguide contents mdash scikit-learn 0192 documentation) python framework had been usedAll their underlaying methods and default hyper-parameters had been used Alsothe mathematical notation of their documentation will be presented to describe eachalgorithmrsquos functions and limitations

The different algorithms had been tested with a benchmark standard dataset for causalinference from observational data Infant Health and Development Program (IHDP)introduced by (Hill 2011) as a semi synthetic-dataset based on real features ob-tained from a real an observational study (Gross 1993) Replications on this datasethad been created to get 10 100 and 1000 cases to be able to train and predict the ma-chine learning models on them and get the desired metric error results afterwards

It is important to notice that the testing method and the metrics used to determinethe effectiveness of each algorithm are different from the ones normally used to testmachine learning algorithms and testing these algorithms within a causal frame-work differs substantially from the usual traintest paradigm of the machine learn-ing field

Since a synthetic-dataset has been used the real Individual Treatment Effect is avail-able to perform testing metrics Therefore the experiment results will consist ofthe performance of each algorithm based on Individual Treatment Effect AverageTreatment Effect (ATE) and Precision in Estimation of Heterogeneous Effect (PEHE)These three are the metrics displayed in the Experiments chapter for each algorithmtrained A detailed explanation of these formulas will be found in the next sections203

Also it is important to notice that the machine learning algorithms are trained justusing the treatment applied (observed) the features (covariates in causal inferenceliterature) and the observed outcome (usually known as Y or Y factual) Aftertraining an completely unseen dataset during is used for testing purposes

The already trained algorithm predicts just the Y factual based on the unit (alsoknown as patient) features (covariates) for the both the cases the unit would havetaken the treatment and likewise predicts the outcomes as if the unit would havetaken the control treatment Once both outcomes are predicted (Y factual and Ycounterfactual the ATE ITE and PEHE metrics are calculated In addition an aver-age score and its deviation for each run of the 10 100 and 1000 replications of theIHDP had been run to evaluate these errors in bigger simulated scenarios

The mathematical notation will be kept as minimum as possible to not confuse thereader with unnecessary information

4 Chapter 1 Introduction

14 Scope and Limitation

In this document the outcomes of the applied treatment to a patient will be onlyanalyzed with respect two possible actions (binary treatment) Multi-valued treat-ments are not going to be covered in the experiments nor in the developed code butthey can be easily extended to cover these cases

A binary treatment is applied but its outcome value is continuous Different to themost common used case of four possibles scenarios in which usually just two can beobserved or measured All the experiments and the code developed can be appliedto discrete outputs but other machine learning techniques could be more suitable forthis type of predictions (classification algorithms) Also there are cases in which thetask is to predict weather to apply or not the treatment to an individual (also knownas unit or patient) To predict in this cases turns out more into a classification task inwhich a threshold on the interval confidence of predicting to affirmatively apply thetreatment is usually set and validated through trial and error against several contin-uous values to determine what would be the one that predicts with best accuracyThis is call as Policy Risk on causal inference literature

This case is more similar to real world scenarios where the data was observed andfinally a decision on applying or not the treatment (action) has to be made in orderto peruse a desired result

In this work the cases in which the dataset contains outcomes in a binary form topredict weather or not to apply the treatment will not be covered

5

Chapter 2

Background

201 Rubin-Newman Causal Model

The Rubin causal model (RCM) (Rubin 2005) also known as Rubin-Newman Poten-tial Outcomes framework is a extended statistical analysis frame to model observa-tional data that Donald Rubin developed He came up with the mentioned frame-work building it on top of the original Newman method that he developed in his1923 masters thesis extending it to non randomized controlled trials (observationaldata)

The Rubin-Newman potential outcomes framework consists in

xi isin X

with an effectively applied treatment

ti isin 0 1

The two possible potential outcomes are defined by

Y0(xi) Y1(xi) isin Y

Of one of them (the one which actually happened) we can observe its factual out-come

yiF= tiY1(xi) + (1 minus ti)Y0(xi)

And let (x1 t1 y1F) (xn tn yn

F) be a unit from the factual distribution

Consequently let (x1 1 minus t1 y1CF) (xn 1 minus tn yn

CF be the counterfactual sample

Notice that all the factual outcomes yF are known whereas is never the case in anyunit for the counterfactual outcomes yCF (except for testing phase and just when thedataset is semi-synthetic or synthetic)

It will be used as interchangeable terms the expressions yF or y f t referring to factualobserved outcomes while yFC or yc f t will be pointing to counterfactual outcomes

6 Chapter 2 Background

202 The fundamental problem of causal analysis

The fundamental problem of causal analysis states that it is impossible given a unit xand assigning either the treatment t = 1 or t = 0 to that unit to observe the counterfactual outcome E[Y0∣x t = 1] or E[Y1∣x t = 0] (what would have happen or whatwould have been the outcome if the other treatment would have been given to theunit x )

However it is always possible to observe the outcome of the effectively appliedtreatment t which is represented as E[Y0∣x t = 0] or E[Y1∣x t = 1] or in shorterterms Y0 or Y1

In this dissertation the focus is on the case when the causal graph is simple andknown to be of the form (Y1 Y0) lt- x -gt t with no hidden confounders

This problem can be extended as most of the problems and applications discussedunder the Rubin-Newman Potential Outcomes Frameworks in this dissertation as amulti-treatment experiment It is important to notice that the problem of not havingaccess to the result of the counter factual outcome Yc f is even worst when extendingthis problem to multi-treatment experiments since the missing values that mattersfor better Individual Treatment Effects are increased in the total order of possibletreatments except the one applied

203 Metrics for Causality

Three well-known metrics in the causality field are reported for each implementedmachine learning technique applied

The losses that will be reported are

bull εITE Error of Individual Treatment Effect - also known as the ConditionalAverage Treatment Effect (CATE) - and it is how well or bad perform the treat-ment on one particular patient

ITE(x) ∶= E[y∣X = x t = 1] minus E[y∣X = x t = 0] = E[Yx1 minusYx0]

bull εATE Error of Average Treatment Effect as it name describes it it representthe effect that the applied treatment either t = 0 or t = 1 depending on thewhole population effectively had Note bold that as an average it can be notthe best solution to treat a new patient with this treatment since its unique char-acteristics as a unit might make them experience wrong results or no results atall

ATE ∶= E[ITE(x)] = E[δ] = E[Y1 minusY0]forallx isin X

bull εPEHE Precision in Estimation of Heterogeneous Effect is used to measure theprecision trade-off between the Individual Treatment Effect and the AverageTreatment Effect It is important to notice that this metric relates the ATE andITE predictions penalizing the predictions that had been predicted right for onemeasure but wrong or not that accurate for the other one

PEHE(x) ∶= 1N

N

sumi=1

((yi1 minus yi0) minus (yi1 minus yi0))2

Chapter 2 Background 7

204 Assumptions

To work on the results three important assumptions under the Rubin-Newmancausal Framework shall be made

bull Consistency For each unit just one of the two potential outcomes can be ob-served Hence if t = 0 then y = Y0 will be the observed outcome or factual(yF) However if the applied treatment was t = 1 afterwards y = Y1 will bethe available observed outcome or factual yF

bull Strong Ignorability Also known as no unmeasured confounders this assump-tion can be stated by (Y1 Y0) ⫫ t∣x and 0 lt p(t = 1∣x) lt 1forallx It is importantto notice that to be able to state this assumption a domain knowledge expertwould have to assess the dataset and therefore determine if there are no un-measured confounders That is the case for the dataset implemented in thiswork

bull Common Support This assumption states that for each unit x isin X there is apositive probability of being both treated (t = 1) and untreated (t = 0)

0 lt P(t = 1∣x) lt 1

205 Definitions

In causal inference from observational data several terms are used interchangeablyand might confuse the reader

This subsection should be clear before going further into this dissertation

Some common synonyms are

bull unit is the subject of the analysis the one that will be applied the treatmentpatient individual input xi xi isin X

bull covariates all the collected (observed) variable that have a direct effect on theoutcomefeatures(ML) x x isin X

bull treatment the possible different actions that can be applied to a unit Usuallybinary but can be multi-valued under the Rubin-Newman Potential OutcomesFrameworkaction t t isin 0 1 t isin 0 N

bull Outcome the measured result of applying a treatment t to a unit xobserved outcome result factual Y factual y f = yF

bull Counterfactual what would have been the result if the opposite treatment tothe effectively applied would have been applied to a unitunobserved outcome yc f t yc f YCF

206 Related Work

Potential outcomes are the framework to mathematically describe causality and coun-terfactuals (Rubin 1978)

8 Chapter 2 Background

Causality from observational data can be clearly applicable to a wide range of in-dustries eg advertisement placement selection health care systems finance oreven to improve education (Recursive Partitioning for Heterogeneous Causal Effects Hoiles and Van Der Schaar 2016 Bottou et al 2013) In particular counterfac-

tual inference in observational studies has been a topic of interested study in eco-nomics statistics health care pharmaceutical companies epidemiology and sociol-ogy (Causal Inference Using Potential Outcomes Design Modeling Decisions Morganand Winship 2014 Causal Inference Using Potential Outcomes Design Modeling De-cisions Chernozhukov et al 2016) whereas in machine learning the attention hasbeen caught not less than a decade ago (Lang 1995 Bottou et al 2013 Swaminathanand Joachims 2015a) A lot of work in machine learning had been targeted for dis-covering the underlying causal graph from collected data (Nonlinear causal discoverywith additive noise models Maathuis et al 2010 Triantafillou and Tsamardinos 2015Mooij et al 2016)

causal inference for counterfactual predictions is usually grouped by parametricnon-parametric and doubly robust methods

For parametric methods causal inference the relationships within features and ac-tions pairs and rewards by implementing one or more parameters trying to specifi-cally model the relations within context outcomes and actions (treatments) In thesemethods linear and logistic regression (Prentice 1976 Gelman and Hill 2007) ran-dom forests (Wager and Athey 2015) and regression trees (Chipman George andMcCulloch 2010) had been used in the past to complete the task For example (Wa-ger and Athey 2017) estimates ITEs by causal Forests but their asymptotic estimatesin datasets with a large number of relevant features has limitations that needs to beaddressed in future work

In non-parametric approaches the counterfactual predictions are mostly calculatedthrough a propensity score matching and re-weighting (Joachims and Swaminathan2016 Austin 2011 Rosenbaum and Rubin 1983 Rosenbaum 2002) To performdoubly robust causality is done by merging parametric and non-parametric methods(Dudik Langford and Li 2011 Jiang and Li 2015)

Double robust methods are known for merging the characteristics of both methodsA common example of this application would be propensity score weighted regres-sion (Bang and Robins 2005 Dudik Langford and Li 2011) When the treatmentassignment probability is known this method models the problem particularly welleg in off-policy evaluation or learning from bandits However in most of the casesin observational data their efficiency drops dramatically (Kang and Schafer 2007)

Machine learning for predicting Individual Treatment effects has been arisen a lotof interest during the last two years through the development of custom metricfunctions -as long as the application of other techniques on causality- with specialfocus on unbalanced treatment application datasets This refers to the sub area ofcausality which is known as causal inference from observational data Observationaldata is data that has been or is collected without the possibility of design and run aproper Randomized Controlled Trial The creation of custom distance learning met-rics and custom loss functions applied to Neural Networks had brought interesting

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 12: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

13 Approach and Methodology 3

13 Approach and Methodology

For causality analysis purposes the Rubin Potential Outcomes Framework (CausalInference Using Potential Outcomes Design Modeling Decisions) and its notation willbe used during this whole thesis This model is also known as the Newman-Rubincausal model and it is an approach to statistical analysis of cause and effect based onthe potential outcomes framework

For the machine learning experiments the latest version of the scikit-learn(Userguide contents mdash scikit-learn 0192 documentation) python framework had been usedAll their underlaying methods and default hyper-parameters had been used Alsothe mathematical notation of their documentation will be presented to describe eachalgorithmrsquos functions and limitations

The different algorithms had been tested with a benchmark standard dataset for causalinference from observational data Infant Health and Development Program (IHDP)introduced by (Hill 2011) as a semi synthetic-dataset based on real features ob-tained from a real an observational study (Gross 1993) Replications on this datasethad been created to get 10 100 and 1000 cases to be able to train and predict the ma-chine learning models on them and get the desired metric error results afterwards

It is important to notice that the testing method and the metrics used to determinethe effectiveness of each algorithm are different from the ones normally used to testmachine learning algorithms and testing these algorithms within a causal frame-work differs substantially from the usual traintest paradigm of the machine learn-ing field

Since a synthetic-dataset has been used the real Individual Treatment Effect is avail-able to perform testing metrics Therefore the experiment results will consist ofthe performance of each algorithm based on Individual Treatment Effect AverageTreatment Effect (ATE) and Precision in Estimation of Heterogeneous Effect (PEHE)These three are the metrics displayed in the Experiments chapter for each algorithmtrained A detailed explanation of these formulas will be found in the next sections203

Also it is important to notice that the machine learning algorithms are trained justusing the treatment applied (observed) the features (covariates in causal inferenceliterature) and the observed outcome (usually known as Y or Y factual) Aftertraining an completely unseen dataset during is used for testing purposes

The already trained algorithm predicts just the Y factual based on the unit (alsoknown as patient) features (covariates) for the both the cases the unit would havetaken the treatment and likewise predicts the outcomes as if the unit would havetaken the control treatment Once both outcomes are predicted (Y factual and Ycounterfactual the ATE ITE and PEHE metrics are calculated In addition an aver-age score and its deviation for each run of the 10 100 and 1000 replications of theIHDP had been run to evaluate these errors in bigger simulated scenarios

The mathematical notation will be kept as minimum as possible to not confuse thereader with unnecessary information

4 Chapter 1 Introduction

14 Scope and Limitation

In this document the outcomes of the applied treatment to a patient will be onlyanalyzed with respect two possible actions (binary treatment) Multi-valued treat-ments are not going to be covered in the experiments nor in the developed code butthey can be easily extended to cover these cases

A binary treatment is applied but its outcome value is continuous Different to themost common used case of four possibles scenarios in which usually just two can beobserved or measured All the experiments and the code developed can be appliedto discrete outputs but other machine learning techniques could be more suitable forthis type of predictions (classification algorithms) Also there are cases in which thetask is to predict weather to apply or not the treatment to an individual (also knownas unit or patient) To predict in this cases turns out more into a classification task inwhich a threshold on the interval confidence of predicting to affirmatively apply thetreatment is usually set and validated through trial and error against several contin-uous values to determine what would be the one that predicts with best accuracyThis is call as Policy Risk on causal inference literature

This case is more similar to real world scenarios where the data was observed andfinally a decision on applying or not the treatment (action) has to be made in orderto peruse a desired result

In this work the cases in which the dataset contains outcomes in a binary form topredict weather or not to apply the treatment will not be covered

5

Chapter 2

Background

201 Rubin-Newman Causal Model

The Rubin causal model (RCM) (Rubin 2005) also known as Rubin-Newman Poten-tial Outcomes framework is a extended statistical analysis frame to model observa-tional data that Donald Rubin developed He came up with the mentioned frame-work building it on top of the original Newman method that he developed in his1923 masters thesis extending it to non randomized controlled trials (observationaldata)

The Rubin-Newman potential outcomes framework consists in

xi isin X

with an effectively applied treatment

ti isin 0 1

The two possible potential outcomes are defined by

Y0(xi) Y1(xi) isin Y

Of one of them (the one which actually happened) we can observe its factual out-come

yiF= tiY1(xi) + (1 minus ti)Y0(xi)

And let (x1 t1 y1F) (xn tn yn

F) be a unit from the factual distribution

Consequently let (x1 1 minus t1 y1CF) (xn 1 minus tn yn

CF be the counterfactual sample

Notice that all the factual outcomes yF are known whereas is never the case in anyunit for the counterfactual outcomes yCF (except for testing phase and just when thedataset is semi-synthetic or synthetic)

It will be used as interchangeable terms the expressions yF or y f t referring to factualobserved outcomes while yFC or yc f t will be pointing to counterfactual outcomes

6 Chapter 2 Background

202 The fundamental problem of causal analysis

The fundamental problem of causal analysis states that it is impossible given a unit xand assigning either the treatment t = 1 or t = 0 to that unit to observe the counterfactual outcome E[Y0∣x t = 1] or E[Y1∣x t = 0] (what would have happen or whatwould have been the outcome if the other treatment would have been given to theunit x )

However it is always possible to observe the outcome of the effectively appliedtreatment t which is represented as E[Y0∣x t = 0] or E[Y1∣x t = 1] or in shorterterms Y0 or Y1

In this dissertation the focus is on the case when the causal graph is simple andknown to be of the form (Y1 Y0) lt- x -gt t with no hidden confounders

This problem can be extended as most of the problems and applications discussedunder the Rubin-Newman Potential Outcomes Frameworks in this dissertation as amulti-treatment experiment It is important to notice that the problem of not havingaccess to the result of the counter factual outcome Yc f is even worst when extendingthis problem to multi-treatment experiments since the missing values that mattersfor better Individual Treatment Effects are increased in the total order of possibletreatments except the one applied

203 Metrics for Causality

Three well-known metrics in the causality field are reported for each implementedmachine learning technique applied

The losses that will be reported are

bull εITE Error of Individual Treatment Effect - also known as the ConditionalAverage Treatment Effect (CATE) - and it is how well or bad perform the treat-ment on one particular patient

ITE(x) ∶= E[y∣X = x t = 1] minus E[y∣X = x t = 0] = E[Yx1 minusYx0]

bull εATE Error of Average Treatment Effect as it name describes it it representthe effect that the applied treatment either t = 0 or t = 1 depending on thewhole population effectively had Note bold that as an average it can be notthe best solution to treat a new patient with this treatment since its unique char-acteristics as a unit might make them experience wrong results or no results atall

ATE ∶= E[ITE(x)] = E[δ] = E[Y1 minusY0]forallx isin X

bull εPEHE Precision in Estimation of Heterogeneous Effect is used to measure theprecision trade-off between the Individual Treatment Effect and the AverageTreatment Effect It is important to notice that this metric relates the ATE andITE predictions penalizing the predictions that had been predicted right for onemeasure but wrong or not that accurate for the other one

PEHE(x) ∶= 1N

N

sumi=1

((yi1 minus yi0) minus (yi1 minus yi0))2

Chapter 2 Background 7

204 Assumptions

To work on the results three important assumptions under the Rubin-Newmancausal Framework shall be made

bull Consistency For each unit just one of the two potential outcomes can be ob-served Hence if t = 0 then y = Y0 will be the observed outcome or factual(yF) However if the applied treatment was t = 1 afterwards y = Y1 will bethe available observed outcome or factual yF

bull Strong Ignorability Also known as no unmeasured confounders this assump-tion can be stated by (Y1 Y0) ⫫ t∣x and 0 lt p(t = 1∣x) lt 1forallx It is importantto notice that to be able to state this assumption a domain knowledge expertwould have to assess the dataset and therefore determine if there are no un-measured confounders That is the case for the dataset implemented in thiswork

bull Common Support This assumption states that for each unit x isin X there is apositive probability of being both treated (t = 1) and untreated (t = 0)

0 lt P(t = 1∣x) lt 1

205 Definitions

In causal inference from observational data several terms are used interchangeablyand might confuse the reader

This subsection should be clear before going further into this dissertation

Some common synonyms are

bull unit is the subject of the analysis the one that will be applied the treatmentpatient individual input xi xi isin X

bull covariates all the collected (observed) variable that have a direct effect on theoutcomefeatures(ML) x x isin X

bull treatment the possible different actions that can be applied to a unit Usuallybinary but can be multi-valued under the Rubin-Newman Potential OutcomesFrameworkaction t t isin 0 1 t isin 0 N

bull Outcome the measured result of applying a treatment t to a unit xobserved outcome result factual Y factual y f = yF

bull Counterfactual what would have been the result if the opposite treatment tothe effectively applied would have been applied to a unitunobserved outcome yc f t yc f YCF

206 Related Work

Potential outcomes are the framework to mathematically describe causality and coun-terfactuals (Rubin 1978)

8 Chapter 2 Background

Causality from observational data can be clearly applicable to a wide range of in-dustries eg advertisement placement selection health care systems finance oreven to improve education (Recursive Partitioning for Heterogeneous Causal Effects Hoiles and Van Der Schaar 2016 Bottou et al 2013) In particular counterfac-

tual inference in observational studies has been a topic of interested study in eco-nomics statistics health care pharmaceutical companies epidemiology and sociol-ogy (Causal Inference Using Potential Outcomes Design Modeling Decisions Morganand Winship 2014 Causal Inference Using Potential Outcomes Design Modeling De-cisions Chernozhukov et al 2016) whereas in machine learning the attention hasbeen caught not less than a decade ago (Lang 1995 Bottou et al 2013 Swaminathanand Joachims 2015a) A lot of work in machine learning had been targeted for dis-covering the underlying causal graph from collected data (Nonlinear causal discoverywith additive noise models Maathuis et al 2010 Triantafillou and Tsamardinos 2015Mooij et al 2016)

causal inference for counterfactual predictions is usually grouped by parametricnon-parametric and doubly robust methods

For parametric methods causal inference the relationships within features and ac-tions pairs and rewards by implementing one or more parameters trying to specifi-cally model the relations within context outcomes and actions (treatments) In thesemethods linear and logistic regression (Prentice 1976 Gelman and Hill 2007) ran-dom forests (Wager and Athey 2015) and regression trees (Chipman George andMcCulloch 2010) had been used in the past to complete the task For example (Wa-ger and Athey 2017) estimates ITEs by causal Forests but their asymptotic estimatesin datasets with a large number of relevant features has limitations that needs to beaddressed in future work

In non-parametric approaches the counterfactual predictions are mostly calculatedthrough a propensity score matching and re-weighting (Joachims and Swaminathan2016 Austin 2011 Rosenbaum and Rubin 1983 Rosenbaum 2002) To performdoubly robust causality is done by merging parametric and non-parametric methods(Dudik Langford and Li 2011 Jiang and Li 2015)

Double robust methods are known for merging the characteristics of both methodsA common example of this application would be propensity score weighted regres-sion (Bang and Robins 2005 Dudik Langford and Li 2011) When the treatmentassignment probability is known this method models the problem particularly welleg in off-policy evaluation or learning from bandits However in most of the casesin observational data their efficiency drops dramatically (Kang and Schafer 2007)

Machine learning for predicting Individual Treatment effects has been arisen a lotof interest during the last two years through the development of custom metricfunctions -as long as the application of other techniques on causality- with specialfocus on unbalanced treatment application datasets This refers to the sub area ofcausality which is known as causal inference from observational data Observationaldata is data that has been or is collected without the possibility of design and run aproper Randomized Controlled Trial The creation of custom distance learning met-rics and custom loss functions applied to Neural Networks had brought interesting

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 13: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

4 Chapter 1 Introduction

14 Scope and Limitation

In this document the outcomes of the applied treatment to a patient will be onlyanalyzed with respect two possible actions (binary treatment) Multi-valued treat-ments are not going to be covered in the experiments nor in the developed code butthey can be easily extended to cover these cases

A binary treatment is applied but its outcome value is continuous Different to themost common used case of four possibles scenarios in which usually just two can beobserved or measured All the experiments and the code developed can be appliedto discrete outputs but other machine learning techniques could be more suitable forthis type of predictions (classification algorithms) Also there are cases in which thetask is to predict weather to apply or not the treatment to an individual (also knownas unit or patient) To predict in this cases turns out more into a classification task inwhich a threshold on the interval confidence of predicting to affirmatively apply thetreatment is usually set and validated through trial and error against several contin-uous values to determine what would be the one that predicts with best accuracyThis is call as Policy Risk on causal inference literature

This case is more similar to real world scenarios where the data was observed andfinally a decision on applying or not the treatment (action) has to be made in orderto peruse a desired result

In this work the cases in which the dataset contains outcomes in a binary form topredict weather or not to apply the treatment will not be covered

5

Chapter 2

Background

201 Rubin-Newman Causal Model

The Rubin causal model (RCM) (Rubin 2005) also known as Rubin-Newman Poten-tial Outcomes framework is a extended statistical analysis frame to model observa-tional data that Donald Rubin developed He came up with the mentioned frame-work building it on top of the original Newman method that he developed in his1923 masters thesis extending it to non randomized controlled trials (observationaldata)

The Rubin-Newman potential outcomes framework consists in

xi isin X

with an effectively applied treatment

ti isin 0 1

The two possible potential outcomes are defined by

Y0(xi) Y1(xi) isin Y

Of one of them (the one which actually happened) we can observe its factual out-come

yiF= tiY1(xi) + (1 minus ti)Y0(xi)

And let (x1 t1 y1F) (xn tn yn

F) be a unit from the factual distribution

Consequently let (x1 1 minus t1 y1CF) (xn 1 minus tn yn

CF be the counterfactual sample

Notice that all the factual outcomes yF are known whereas is never the case in anyunit for the counterfactual outcomes yCF (except for testing phase and just when thedataset is semi-synthetic or synthetic)

It will be used as interchangeable terms the expressions yF or y f t referring to factualobserved outcomes while yFC or yc f t will be pointing to counterfactual outcomes

6 Chapter 2 Background

202 The fundamental problem of causal analysis

The fundamental problem of causal analysis states that it is impossible given a unit xand assigning either the treatment t = 1 or t = 0 to that unit to observe the counterfactual outcome E[Y0∣x t = 1] or E[Y1∣x t = 0] (what would have happen or whatwould have been the outcome if the other treatment would have been given to theunit x )

However it is always possible to observe the outcome of the effectively appliedtreatment t which is represented as E[Y0∣x t = 0] or E[Y1∣x t = 1] or in shorterterms Y0 or Y1

In this dissertation the focus is on the case when the causal graph is simple andknown to be of the form (Y1 Y0) lt- x -gt t with no hidden confounders

This problem can be extended as most of the problems and applications discussedunder the Rubin-Newman Potential Outcomes Frameworks in this dissertation as amulti-treatment experiment It is important to notice that the problem of not havingaccess to the result of the counter factual outcome Yc f is even worst when extendingthis problem to multi-treatment experiments since the missing values that mattersfor better Individual Treatment Effects are increased in the total order of possibletreatments except the one applied

203 Metrics for Causality

Three well-known metrics in the causality field are reported for each implementedmachine learning technique applied

The losses that will be reported are

bull εITE Error of Individual Treatment Effect - also known as the ConditionalAverage Treatment Effect (CATE) - and it is how well or bad perform the treat-ment on one particular patient

ITE(x) ∶= E[y∣X = x t = 1] minus E[y∣X = x t = 0] = E[Yx1 minusYx0]

bull εATE Error of Average Treatment Effect as it name describes it it representthe effect that the applied treatment either t = 0 or t = 1 depending on thewhole population effectively had Note bold that as an average it can be notthe best solution to treat a new patient with this treatment since its unique char-acteristics as a unit might make them experience wrong results or no results atall

ATE ∶= E[ITE(x)] = E[δ] = E[Y1 minusY0]forallx isin X

bull εPEHE Precision in Estimation of Heterogeneous Effect is used to measure theprecision trade-off between the Individual Treatment Effect and the AverageTreatment Effect It is important to notice that this metric relates the ATE andITE predictions penalizing the predictions that had been predicted right for onemeasure but wrong or not that accurate for the other one

PEHE(x) ∶= 1N

N

sumi=1

((yi1 minus yi0) minus (yi1 minus yi0))2

Chapter 2 Background 7

204 Assumptions

To work on the results three important assumptions under the Rubin-Newmancausal Framework shall be made

bull Consistency For each unit just one of the two potential outcomes can be ob-served Hence if t = 0 then y = Y0 will be the observed outcome or factual(yF) However if the applied treatment was t = 1 afterwards y = Y1 will bethe available observed outcome or factual yF

bull Strong Ignorability Also known as no unmeasured confounders this assump-tion can be stated by (Y1 Y0) ⫫ t∣x and 0 lt p(t = 1∣x) lt 1forallx It is importantto notice that to be able to state this assumption a domain knowledge expertwould have to assess the dataset and therefore determine if there are no un-measured confounders That is the case for the dataset implemented in thiswork

bull Common Support This assumption states that for each unit x isin X there is apositive probability of being both treated (t = 1) and untreated (t = 0)

0 lt P(t = 1∣x) lt 1

205 Definitions

In causal inference from observational data several terms are used interchangeablyand might confuse the reader

This subsection should be clear before going further into this dissertation

Some common synonyms are

bull unit is the subject of the analysis the one that will be applied the treatmentpatient individual input xi xi isin X

bull covariates all the collected (observed) variable that have a direct effect on theoutcomefeatures(ML) x x isin X

bull treatment the possible different actions that can be applied to a unit Usuallybinary but can be multi-valued under the Rubin-Newman Potential OutcomesFrameworkaction t t isin 0 1 t isin 0 N

bull Outcome the measured result of applying a treatment t to a unit xobserved outcome result factual Y factual y f = yF

bull Counterfactual what would have been the result if the opposite treatment tothe effectively applied would have been applied to a unitunobserved outcome yc f t yc f YCF

206 Related Work

Potential outcomes are the framework to mathematically describe causality and coun-terfactuals (Rubin 1978)

8 Chapter 2 Background

Causality from observational data can be clearly applicable to a wide range of in-dustries eg advertisement placement selection health care systems finance oreven to improve education (Recursive Partitioning for Heterogeneous Causal Effects Hoiles and Van Der Schaar 2016 Bottou et al 2013) In particular counterfac-

tual inference in observational studies has been a topic of interested study in eco-nomics statistics health care pharmaceutical companies epidemiology and sociol-ogy (Causal Inference Using Potential Outcomes Design Modeling Decisions Morganand Winship 2014 Causal Inference Using Potential Outcomes Design Modeling De-cisions Chernozhukov et al 2016) whereas in machine learning the attention hasbeen caught not less than a decade ago (Lang 1995 Bottou et al 2013 Swaminathanand Joachims 2015a) A lot of work in machine learning had been targeted for dis-covering the underlying causal graph from collected data (Nonlinear causal discoverywith additive noise models Maathuis et al 2010 Triantafillou and Tsamardinos 2015Mooij et al 2016)

causal inference for counterfactual predictions is usually grouped by parametricnon-parametric and doubly robust methods

For parametric methods causal inference the relationships within features and ac-tions pairs and rewards by implementing one or more parameters trying to specifi-cally model the relations within context outcomes and actions (treatments) In thesemethods linear and logistic regression (Prentice 1976 Gelman and Hill 2007) ran-dom forests (Wager and Athey 2015) and regression trees (Chipman George andMcCulloch 2010) had been used in the past to complete the task For example (Wa-ger and Athey 2017) estimates ITEs by causal Forests but their asymptotic estimatesin datasets with a large number of relevant features has limitations that needs to beaddressed in future work

In non-parametric approaches the counterfactual predictions are mostly calculatedthrough a propensity score matching and re-weighting (Joachims and Swaminathan2016 Austin 2011 Rosenbaum and Rubin 1983 Rosenbaum 2002) To performdoubly robust causality is done by merging parametric and non-parametric methods(Dudik Langford and Li 2011 Jiang and Li 2015)

Double robust methods are known for merging the characteristics of both methodsA common example of this application would be propensity score weighted regres-sion (Bang and Robins 2005 Dudik Langford and Li 2011) When the treatmentassignment probability is known this method models the problem particularly welleg in off-policy evaluation or learning from bandits However in most of the casesin observational data their efficiency drops dramatically (Kang and Schafer 2007)

Machine learning for predicting Individual Treatment effects has been arisen a lotof interest during the last two years through the development of custom metricfunctions -as long as the application of other techniques on causality- with specialfocus on unbalanced treatment application datasets This refers to the sub area ofcausality which is known as causal inference from observational data Observationaldata is data that has been or is collected without the possibility of design and run aproper Randomized Controlled Trial The creation of custom distance learning met-rics and custom loss functions applied to Neural Networks had brought interesting

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 14: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

5

Chapter 2

Background

201 Rubin-Newman Causal Model

The Rubin causal model (RCM) (Rubin 2005) also known as Rubin-Newman Poten-tial Outcomes framework is a extended statistical analysis frame to model observa-tional data that Donald Rubin developed He came up with the mentioned frame-work building it on top of the original Newman method that he developed in his1923 masters thesis extending it to non randomized controlled trials (observationaldata)

The Rubin-Newman potential outcomes framework consists in

xi isin X

with an effectively applied treatment

ti isin 0 1

The two possible potential outcomes are defined by

Y0(xi) Y1(xi) isin Y

Of one of them (the one which actually happened) we can observe its factual out-come

yiF= tiY1(xi) + (1 minus ti)Y0(xi)

And let (x1 t1 y1F) (xn tn yn

F) be a unit from the factual distribution

Consequently let (x1 1 minus t1 y1CF) (xn 1 minus tn yn

CF be the counterfactual sample

Notice that all the factual outcomes yF are known whereas is never the case in anyunit for the counterfactual outcomes yCF (except for testing phase and just when thedataset is semi-synthetic or synthetic)

It will be used as interchangeable terms the expressions yF or y f t referring to factualobserved outcomes while yFC or yc f t will be pointing to counterfactual outcomes

6 Chapter 2 Background

202 The fundamental problem of causal analysis

The fundamental problem of causal analysis states that it is impossible given a unit xand assigning either the treatment t = 1 or t = 0 to that unit to observe the counterfactual outcome E[Y0∣x t = 1] or E[Y1∣x t = 0] (what would have happen or whatwould have been the outcome if the other treatment would have been given to theunit x )

However it is always possible to observe the outcome of the effectively appliedtreatment t which is represented as E[Y0∣x t = 0] or E[Y1∣x t = 1] or in shorterterms Y0 or Y1

In this dissertation the focus is on the case when the causal graph is simple andknown to be of the form (Y1 Y0) lt- x -gt t with no hidden confounders

This problem can be extended as most of the problems and applications discussedunder the Rubin-Newman Potential Outcomes Frameworks in this dissertation as amulti-treatment experiment It is important to notice that the problem of not havingaccess to the result of the counter factual outcome Yc f is even worst when extendingthis problem to multi-treatment experiments since the missing values that mattersfor better Individual Treatment Effects are increased in the total order of possibletreatments except the one applied

203 Metrics for Causality

Three well-known metrics in the causality field are reported for each implementedmachine learning technique applied

The losses that will be reported are

bull εITE Error of Individual Treatment Effect - also known as the ConditionalAverage Treatment Effect (CATE) - and it is how well or bad perform the treat-ment on one particular patient

ITE(x) ∶= E[y∣X = x t = 1] minus E[y∣X = x t = 0] = E[Yx1 minusYx0]

bull εATE Error of Average Treatment Effect as it name describes it it representthe effect that the applied treatment either t = 0 or t = 1 depending on thewhole population effectively had Note bold that as an average it can be notthe best solution to treat a new patient with this treatment since its unique char-acteristics as a unit might make them experience wrong results or no results atall

ATE ∶= E[ITE(x)] = E[δ] = E[Y1 minusY0]forallx isin X

bull εPEHE Precision in Estimation of Heterogeneous Effect is used to measure theprecision trade-off between the Individual Treatment Effect and the AverageTreatment Effect It is important to notice that this metric relates the ATE andITE predictions penalizing the predictions that had been predicted right for onemeasure but wrong or not that accurate for the other one

PEHE(x) ∶= 1N

N

sumi=1

((yi1 minus yi0) minus (yi1 minus yi0))2

Chapter 2 Background 7

204 Assumptions

To work on the results three important assumptions under the Rubin-Newmancausal Framework shall be made

bull Consistency For each unit just one of the two potential outcomes can be ob-served Hence if t = 0 then y = Y0 will be the observed outcome or factual(yF) However if the applied treatment was t = 1 afterwards y = Y1 will bethe available observed outcome or factual yF

bull Strong Ignorability Also known as no unmeasured confounders this assump-tion can be stated by (Y1 Y0) ⫫ t∣x and 0 lt p(t = 1∣x) lt 1forallx It is importantto notice that to be able to state this assumption a domain knowledge expertwould have to assess the dataset and therefore determine if there are no un-measured confounders That is the case for the dataset implemented in thiswork

bull Common Support This assumption states that for each unit x isin X there is apositive probability of being both treated (t = 1) and untreated (t = 0)

0 lt P(t = 1∣x) lt 1

205 Definitions

In causal inference from observational data several terms are used interchangeablyand might confuse the reader

This subsection should be clear before going further into this dissertation

Some common synonyms are

bull unit is the subject of the analysis the one that will be applied the treatmentpatient individual input xi xi isin X

bull covariates all the collected (observed) variable that have a direct effect on theoutcomefeatures(ML) x x isin X

bull treatment the possible different actions that can be applied to a unit Usuallybinary but can be multi-valued under the Rubin-Newman Potential OutcomesFrameworkaction t t isin 0 1 t isin 0 N

bull Outcome the measured result of applying a treatment t to a unit xobserved outcome result factual Y factual y f = yF

bull Counterfactual what would have been the result if the opposite treatment tothe effectively applied would have been applied to a unitunobserved outcome yc f t yc f YCF

206 Related Work

Potential outcomes are the framework to mathematically describe causality and coun-terfactuals (Rubin 1978)

8 Chapter 2 Background

Causality from observational data can be clearly applicable to a wide range of in-dustries eg advertisement placement selection health care systems finance oreven to improve education (Recursive Partitioning for Heterogeneous Causal Effects Hoiles and Van Der Schaar 2016 Bottou et al 2013) In particular counterfac-

tual inference in observational studies has been a topic of interested study in eco-nomics statistics health care pharmaceutical companies epidemiology and sociol-ogy (Causal Inference Using Potential Outcomes Design Modeling Decisions Morganand Winship 2014 Causal Inference Using Potential Outcomes Design Modeling De-cisions Chernozhukov et al 2016) whereas in machine learning the attention hasbeen caught not less than a decade ago (Lang 1995 Bottou et al 2013 Swaminathanand Joachims 2015a) A lot of work in machine learning had been targeted for dis-covering the underlying causal graph from collected data (Nonlinear causal discoverywith additive noise models Maathuis et al 2010 Triantafillou and Tsamardinos 2015Mooij et al 2016)

causal inference for counterfactual predictions is usually grouped by parametricnon-parametric and doubly robust methods

For parametric methods causal inference the relationships within features and ac-tions pairs and rewards by implementing one or more parameters trying to specifi-cally model the relations within context outcomes and actions (treatments) In thesemethods linear and logistic regression (Prentice 1976 Gelman and Hill 2007) ran-dom forests (Wager and Athey 2015) and regression trees (Chipman George andMcCulloch 2010) had been used in the past to complete the task For example (Wa-ger and Athey 2017) estimates ITEs by causal Forests but their asymptotic estimatesin datasets with a large number of relevant features has limitations that needs to beaddressed in future work

In non-parametric approaches the counterfactual predictions are mostly calculatedthrough a propensity score matching and re-weighting (Joachims and Swaminathan2016 Austin 2011 Rosenbaum and Rubin 1983 Rosenbaum 2002) To performdoubly robust causality is done by merging parametric and non-parametric methods(Dudik Langford and Li 2011 Jiang and Li 2015)

Double robust methods are known for merging the characteristics of both methodsA common example of this application would be propensity score weighted regres-sion (Bang and Robins 2005 Dudik Langford and Li 2011) When the treatmentassignment probability is known this method models the problem particularly welleg in off-policy evaluation or learning from bandits However in most of the casesin observational data their efficiency drops dramatically (Kang and Schafer 2007)

Machine learning for predicting Individual Treatment effects has been arisen a lotof interest during the last two years through the development of custom metricfunctions -as long as the application of other techniques on causality- with specialfocus on unbalanced treatment application datasets This refers to the sub area ofcausality which is known as causal inference from observational data Observationaldata is data that has been or is collected without the possibility of design and run aproper Randomized Controlled Trial The creation of custom distance learning met-rics and custom loss functions applied to Neural Networks had brought interesting

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 15: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

6 Chapter 2 Background

202 The fundamental problem of causal analysis

The fundamental problem of causal analysis states that it is impossible given a unit xand assigning either the treatment t = 1 or t = 0 to that unit to observe the counterfactual outcome E[Y0∣x t = 1] or E[Y1∣x t = 0] (what would have happen or whatwould have been the outcome if the other treatment would have been given to theunit x )

However it is always possible to observe the outcome of the effectively appliedtreatment t which is represented as E[Y0∣x t = 0] or E[Y1∣x t = 1] or in shorterterms Y0 or Y1

In this dissertation the focus is on the case when the causal graph is simple andknown to be of the form (Y1 Y0) lt- x -gt t with no hidden confounders

This problem can be extended as most of the problems and applications discussedunder the Rubin-Newman Potential Outcomes Frameworks in this dissertation as amulti-treatment experiment It is important to notice that the problem of not havingaccess to the result of the counter factual outcome Yc f is even worst when extendingthis problem to multi-treatment experiments since the missing values that mattersfor better Individual Treatment Effects are increased in the total order of possibletreatments except the one applied

203 Metrics for Causality

Three well-known metrics in the causality field are reported for each implementedmachine learning technique applied

The losses that will be reported are

bull εITE Error of Individual Treatment Effect - also known as the ConditionalAverage Treatment Effect (CATE) - and it is how well or bad perform the treat-ment on one particular patient

ITE(x) ∶= E[y∣X = x t = 1] minus E[y∣X = x t = 0] = E[Yx1 minusYx0]

bull εATE Error of Average Treatment Effect as it name describes it it representthe effect that the applied treatment either t = 0 or t = 1 depending on thewhole population effectively had Note bold that as an average it can be notthe best solution to treat a new patient with this treatment since its unique char-acteristics as a unit might make them experience wrong results or no results atall

ATE ∶= E[ITE(x)] = E[δ] = E[Y1 minusY0]forallx isin X

bull εPEHE Precision in Estimation of Heterogeneous Effect is used to measure theprecision trade-off between the Individual Treatment Effect and the AverageTreatment Effect It is important to notice that this metric relates the ATE andITE predictions penalizing the predictions that had been predicted right for onemeasure but wrong or not that accurate for the other one

PEHE(x) ∶= 1N

N

sumi=1

((yi1 minus yi0) minus (yi1 minus yi0))2

Chapter 2 Background 7

204 Assumptions

To work on the results three important assumptions under the Rubin-Newmancausal Framework shall be made

bull Consistency For each unit just one of the two potential outcomes can be ob-served Hence if t = 0 then y = Y0 will be the observed outcome or factual(yF) However if the applied treatment was t = 1 afterwards y = Y1 will bethe available observed outcome or factual yF

bull Strong Ignorability Also known as no unmeasured confounders this assump-tion can be stated by (Y1 Y0) ⫫ t∣x and 0 lt p(t = 1∣x) lt 1forallx It is importantto notice that to be able to state this assumption a domain knowledge expertwould have to assess the dataset and therefore determine if there are no un-measured confounders That is the case for the dataset implemented in thiswork

bull Common Support This assumption states that for each unit x isin X there is apositive probability of being both treated (t = 1) and untreated (t = 0)

0 lt P(t = 1∣x) lt 1

205 Definitions

In causal inference from observational data several terms are used interchangeablyand might confuse the reader

This subsection should be clear before going further into this dissertation

Some common synonyms are

bull unit is the subject of the analysis the one that will be applied the treatmentpatient individual input xi xi isin X

bull covariates all the collected (observed) variable that have a direct effect on theoutcomefeatures(ML) x x isin X

bull treatment the possible different actions that can be applied to a unit Usuallybinary but can be multi-valued under the Rubin-Newman Potential OutcomesFrameworkaction t t isin 0 1 t isin 0 N

bull Outcome the measured result of applying a treatment t to a unit xobserved outcome result factual Y factual y f = yF

bull Counterfactual what would have been the result if the opposite treatment tothe effectively applied would have been applied to a unitunobserved outcome yc f t yc f YCF

206 Related Work

Potential outcomes are the framework to mathematically describe causality and coun-terfactuals (Rubin 1978)

8 Chapter 2 Background

Causality from observational data can be clearly applicable to a wide range of in-dustries eg advertisement placement selection health care systems finance oreven to improve education (Recursive Partitioning for Heterogeneous Causal Effects Hoiles and Van Der Schaar 2016 Bottou et al 2013) In particular counterfac-

tual inference in observational studies has been a topic of interested study in eco-nomics statistics health care pharmaceutical companies epidemiology and sociol-ogy (Causal Inference Using Potential Outcomes Design Modeling Decisions Morganand Winship 2014 Causal Inference Using Potential Outcomes Design Modeling De-cisions Chernozhukov et al 2016) whereas in machine learning the attention hasbeen caught not less than a decade ago (Lang 1995 Bottou et al 2013 Swaminathanand Joachims 2015a) A lot of work in machine learning had been targeted for dis-covering the underlying causal graph from collected data (Nonlinear causal discoverywith additive noise models Maathuis et al 2010 Triantafillou and Tsamardinos 2015Mooij et al 2016)

causal inference for counterfactual predictions is usually grouped by parametricnon-parametric and doubly robust methods

For parametric methods causal inference the relationships within features and ac-tions pairs and rewards by implementing one or more parameters trying to specifi-cally model the relations within context outcomes and actions (treatments) In thesemethods linear and logistic regression (Prentice 1976 Gelman and Hill 2007) ran-dom forests (Wager and Athey 2015) and regression trees (Chipman George andMcCulloch 2010) had been used in the past to complete the task For example (Wa-ger and Athey 2017) estimates ITEs by causal Forests but their asymptotic estimatesin datasets with a large number of relevant features has limitations that needs to beaddressed in future work

In non-parametric approaches the counterfactual predictions are mostly calculatedthrough a propensity score matching and re-weighting (Joachims and Swaminathan2016 Austin 2011 Rosenbaum and Rubin 1983 Rosenbaum 2002) To performdoubly robust causality is done by merging parametric and non-parametric methods(Dudik Langford and Li 2011 Jiang and Li 2015)

Double robust methods are known for merging the characteristics of both methodsA common example of this application would be propensity score weighted regres-sion (Bang and Robins 2005 Dudik Langford and Li 2011) When the treatmentassignment probability is known this method models the problem particularly welleg in off-policy evaluation or learning from bandits However in most of the casesin observational data their efficiency drops dramatically (Kang and Schafer 2007)

Machine learning for predicting Individual Treatment effects has been arisen a lotof interest during the last two years through the development of custom metricfunctions -as long as the application of other techniques on causality- with specialfocus on unbalanced treatment application datasets This refers to the sub area ofcausality which is known as causal inference from observational data Observationaldata is data that has been or is collected without the possibility of design and run aproper Randomized Controlled Trial The creation of custom distance learning met-rics and custom loss functions applied to Neural Networks had brought interesting

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 16: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

Chapter 2 Background 7

204 Assumptions

To work on the results three important assumptions under the Rubin-Newmancausal Framework shall be made

bull Consistency For each unit just one of the two potential outcomes can be ob-served Hence if t = 0 then y = Y0 will be the observed outcome or factual(yF) However if the applied treatment was t = 1 afterwards y = Y1 will bethe available observed outcome or factual yF

bull Strong Ignorability Also known as no unmeasured confounders this assump-tion can be stated by (Y1 Y0) ⫫ t∣x and 0 lt p(t = 1∣x) lt 1forallx It is importantto notice that to be able to state this assumption a domain knowledge expertwould have to assess the dataset and therefore determine if there are no un-measured confounders That is the case for the dataset implemented in thiswork

bull Common Support This assumption states that for each unit x isin X there is apositive probability of being both treated (t = 1) and untreated (t = 0)

0 lt P(t = 1∣x) lt 1

205 Definitions

In causal inference from observational data several terms are used interchangeablyand might confuse the reader

This subsection should be clear before going further into this dissertation

Some common synonyms are

bull unit is the subject of the analysis the one that will be applied the treatmentpatient individual input xi xi isin X

bull covariates all the collected (observed) variable that have a direct effect on theoutcomefeatures(ML) x x isin X

bull treatment the possible different actions that can be applied to a unit Usuallybinary but can be multi-valued under the Rubin-Newman Potential OutcomesFrameworkaction t t isin 0 1 t isin 0 N

bull Outcome the measured result of applying a treatment t to a unit xobserved outcome result factual Y factual y f = yF

bull Counterfactual what would have been the result if the opposite treatment tothe effectively applied would have been applied to a unitunobserved outcome yc f t yc f YCF

206 Related Work

Potential outcomes are the framework to mathematically describe causality and coun-terfactuals (Rubin 1978)

8 Chapter 2 Background

Causality from observational data can be clearly applicable to a wide range of in-dustries eg advertisement placement selection health care systems finance oreven to improve education (Recursive Partitioning for Heterogeneous Causal Effects Hoiles and Van Der Schaar 2016 Bottou et al 2013) In particular counterfac-

tual inference in observational studies has been a topic of interested study in eco-nomics statistics health care pharmaceutical companies epidemiology and sociol-ogy (Causal Inference Using Potential Outcomes Design Modeling Decisions Morganand Winship 2014 Causal Inference Using Potential Outcomes Design Modeling De-cisions Chernozhukov et al 2016) whereas in machine learning the attention hasbeen caught not less than a decade ago (Lang 1995 Bottou et al 2013 Swaminathanand Joachims 2015a) A lot of work in machine learning had been targeted for dis-covering the underlying causal graph from collected data (Nonlinear causal discoverywith additive noise models Maathuis et al 2010 Triantafillou and Tsamardinos 2015Mooij et al 2016)

causal inference for counterfactual predictions is usually grouped by parametricnon-parametric and doubly robust methods

For parametric methods causal inference the relationships within features and ac-tions pairs and rewards by implementing one or more parameters trying to specifi-cally model the relations within context outcomes and actions (treatments) In thesemethods linear and logistic regression (Prentice 1976 Gelman and Hill 2007) ran-dom forests (Wager and Athey 2015) and regression trees (Chipman George andMcCulloch 2010) had been used in the past to complete the task For example (Wa-ger and Athey 2017) estimates ITEs by causal Forests but their asymptotic estimatesin datasets with a large number of relevant features has limitations that needs to beaddressed in future work

In non-parametric approaches the counterfactual predictions are mostly calculatedthrough a propensity score matching and re-weighting (Joachims and Swaminathan2016 Austin 2011 Rosenbaum and Rubin 1983 Rosenbaum 2002) To performdoubly robust causality is done by merging parametric and non-parametric methods(Dudik Langford and Li 2011 Jiang and Li 2015)

Double robust methods are known for merging the characteristics of both methodsA common example of this application would be propensity score weighted regres-sion (Bang and Robins 2005 Dudik Langford and Li 2011) When the treatmentassignment probability is known this method models the problem particularly welleg in off-policy evaluation or learning from bandits However in most of the casesin observational data their efficiency drops dramatically (Kang and Schafer 2007)

Machine learning for predicting Individual Treatment effects has been arisen a lotof interest during the last two years through the development of custom metricfunctions -as long as the application of other techniques on causality- with specialfocus on unbalanced treatment application datasets This refers to the sub area ofcausality which is known as causal inference from observational data Observationaldata is data that has been or is collected without the possibility of design and run aproper Randomized Controlled Trial The creation of custom distance learning met-rics and custom loss functions applied to Neural Networks had brought interesting

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 17: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

8 Chapter 2 Background

Causality from observational data can be clearly applicable to a wide range of in-dustries eg advertisement placement selection health care systems finance oreven to improve education (Recursive Partitioning for Heterogeneous Causal Effects Hoiles and Van Der Schaar 2016 Bottou et al 2013) In particular counterfac-

tual inference in observational studies has been a topic of interested study in eco-nomics statistics health care pharmaceutical companies epidemiology and sociol-ogy (Causal Inference Using Potential Outcomes Design Modeling Decisions Morganand Winship 2014 Causal Inference Using Potential Outcomes Design Modeling De-cisions Chernozhukov et al 2016) whereas in machine learning the attention hasbeen caught not less than a decade ago (Lang 1995 Bottou et al 2013 Swaminathanand Joachims 2015a) A lot of work in machine learning had been targeted for dis-covering the underlying causal graph from collected data (Nonlinear causal discoverywith additive noise models Maathuis et al 2010 Triantafillou and Tsamardinos 2015Mooij et al 2016)

causal inference for counterfactual predictions is usually grouped by parametricnon-parametric and doubly robust methods

For parametric methods causal inference the relationships within features and ac-tions pairs and rewards by implementing one or more parameters trying to specifi-cally model the relations within context outcomes and actions (treatments) In thesemethods linear and logistic regression (Prentice 1976 Gelman and Hill 2007) ran-dom forests (Wager and Athey 2015) and regression trees (Chipman George andMcCulloch 2010) had been used in the past to complete the task For example (Wa-ger and Athey 2017) estimates ITEs by causal Forests but their asymptotic estimatesin datasets with a large number of relevant features has limitations that needs to beaddressed in future work

In non-parametric approaches the counterfactual predictions are mostly calculatedthrough a propensity score matching and re-weighting (Joachims and Swaminathan2016 Austin 2011 Rosenbaum and Rubin 1983 Rosenbaum 2002) To performdoubly robust causality is done by merging parametric and non-parametric methods(Dudik Langford and Li 2011 Jiang and Li 2015)

Double robust methods are known for merging the characteristics of both methodsA common example of this application would be propensity score weighted regres-sion (Bang and Robins 2005 Dudik Langford and Li 2011) When the treatmentassignment probability is known this method models the problem particularly welleg in off-policy evaluation or learning from bandits However in most of the casesin observational data their efficiency drops dramatically (Kang and Schafer 2007)

Machine learning for predicting Individual Treatment effects has been arisen a lotof interest during the last two years through the development of custom metricfunctions -as long as the application of other techniques on causality- with specialfocus on unbalanced treatment application datasets This refers to the sub area ofcausality which is known as causal inference from observational data Observationaldata is data that has been or is collected without the possibility of design and run aproper Randomized Controlled Trial The creation of custom distance learning met-rics and custom loss functions applied to Neural Networks had brought interesting

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 18: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

Chapter 2 Background 9

advances to the scientific community (Shalit Johansson and Sontag 2017 ldquoLearn-ing Representations for Counterfactual Inferencerdquo) (Tian et al 2014) modeled inter-actions between the treatment and the inputs (covariates) creating a relatively bal-anced method Specifically for estimation of Individual Treatment Effect (Johans-son Shalit and Sontag 2016a Shalit Johansson and Sontag 2017 Alaa Weisz andVan Der Schaar 2017) had made important contributions whereas for Policy Op-timization (Swaminathan and Joachims 2015a Swaminathan and Joachims 2015b)can be consulted for their work In Policy Optimization the goal is to find a policy(threshold) that maximizes the factual outcome or in other words that takes the riskof predicting the action to the minimum

Adopting machine learning methods to estimate the individual treatment effect hadgained increasing interest in the past years just to name a few (Wager and Athey2015) (Athey and Imbens 2016) (ldquoLearning Representations for Counterfactual In-ferencerdquo) (Shalit and Sontag 2016) (Shalit Johansson and Sontag 2017) (Johans-son Shalit and Sontag 2016a) and (Shalit Johansson and Sontag 2017) worked onlearning balanced representations while using Neural Networks for both learn betterpredictions on the factual outcome and minimize the error loss between the factualand counterfactual representation of the unbalanced observational data Specificallyin (Shalit Johansson and Sontag 2017) the authors built their work based on (Jo-hansson Shalit and Sontag 2016a) focusing on the counterfactual error term de-riving a family of algorithms and metrics in the form of Integral Probability MetricsIn ITE prediction other work was performed by implementing Gaussian processes(Alaa Weisz and Van Der Schaar 2017) and decision trees in different approaches(Hill 2011 Recursive Partitioning for Heterogeneous Causal Effects Wager and Athey2015)

Similarly (Atan et al 2016) faces the problem of learning from biased data and sev-eral features by performing feature selection while predicting among multiple pos-sible actions (outcomes) being this more challenging but modeling closer to actualindustry problems The authors also remarks the difficulty of learning the relevantfeatures leading to predict some actions while not taking them into account for oth-ers The relevant feature selection learning was done by implementing a way ofOnline Contextual Multi-Armed Bandit (CMAB) likewise from (Tekin and Van DerSchaar 2018) with some limitations due to the nature of the observational dataAlso (Joachims and Swaminathan 2016) used IPS estimates and empirical Bern-stein inequalities to learn counterfactual outcomes although they do not workedwith observational data and they do not identify individual important features toperform the task

In terms of Policy Optimization methods (Swaminathan and Joachims 2015a) cameup with a Counterfactual Risk Minimization (CRM) method in which they lookto minimize the Inverse Propensity Score of the units by introducing an algorithmnamed rsquoPOEMrsquo After that (Atan Zame and Van Der Schaar 2018) propose to ad-dress the selection bias by learning representations working closely related to filedto domain adaptation bounds in (Ben-David et al 2007 Blitzer McDonald andPereira 2006) Additional techniques on policy optimization were done by (Beygelz-imer and Langford 2008) in which the propensity scores need to be known solvingthe selection bias through rejection measurements The algorithm that (Atan Zameand Van Der Schaar 2018) introduces is based on domain adaptation (DA) as in (Ganet al 2016) More work in the DA techniques field was done by (Zhang et al 2013Daumeacute 2009)

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 19: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

10 Chapter 2 Background

To conclude in cause and effect analysis Time Series data is widely adopted fordecision making support The main challenge in the continuous time space is toproperly gather feedback from the outcomes to help determine a future decision(treatment) (Robins 1986) was the first to learn and optimize decisions throughouttime accounting for the possible actions Through the integration of action-valuefunctions (Nahum-Shani et al 2012) algorithms can learn rules to make decisionsalong time-steps An estimator on structural nested models was introduced by (Lok2008) Furthermore (Causal Reasoning from Longitudinal Data) used Bayesian poste-rior predictive distributions to solve this time series causality task Later on time(Reliable Decision Support using Counterfactual Models) introduced the rsquoCounterfactualGassian Processrsquo to predict the counterfactual future progression of continuous-timetrajectories under sequences of future actions implementing a Reinforcement Learn-ing approach (Sutton and Barto 2017) with off-policy learning due to the nature ofthe observational data Retrospective observational data is used for off-policy learn-ing to estimate the best expected reward of a policy that is set before (Dudik Lang-ford and Li 2011 Swaminathan and Joachims 2015a Jiang and Li 2015 Paduraruet al 2012 Doroudi Thomas and Brunskill 2017)

21 Machine Learning

In this section the applied machine learning techniques using scikit-learn open sourceframework to perform the experiments will be described

The vast majority of the actual available methods tested belong to Generalized Lin-ear Models and they can be represented as a target or label value as a linear combi-nation of the covariates (inputs)

y(w x) = w0 +w1x1 + +wpxp (21)

where the vector w = (w1 wp) represents the coefficients and w0 is the intercept

211 Ordinary Least Squares (Linear Regression)

In this model the objective is to minimize the residual sum of squares between theobserved dataset and the predictions made on it

Mathematically it solves the problem of

minw

∣∣Xw minus y∣∣22

The main limitation of this method is that if the features (covariates) have an ap-proximate linear dependence the model produces a high variance and therefore itis more sensitive to random errors in the prediction This limitation affects speciallyto data collected with out a design that was previously shaped in a experimentalway

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 20: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

21 Machine Learning 11

212 Ridge Regression

Ridge regression accounts some of the limitations of the above mentioned LinearRegression method by penalizing the coefficientrsquos size There can be noticed the lossturns into a problem of minimizing the sum of the squares penalized

minw

∣∣Xw minus y∣∣22+ α∣∣w∣∣22

It is worth mentioning that the parameter α ge 0 is the one that takes into accountthe amount of robustness to collinearity that the trained model is going to have

213 Support Vector Regressor

A Support Vector Regressor (SVR) method is an extension of the widely spread Sup-port Vector Machines for classification in order to solve regression problems Duringthe training phase the best possible solution is the one that gets less penalized in to-tal by a loss function The vectors will be the inputs that are either misclassifiedclassified within enough margin or the ones on the edged of the hyper-plane gener-ated that splits the dataset for future predictions

In particular a SVR takes the training vectors xi isin Rp i = 1 n and a vector

y isin RnεminusSVR solves the following primal problem

minwbζζlowast

12

wTw + Cn

sumi=1

(ζi + ζlowasti )

where e is the vector of all ones C gt 0 is the upper bound Q is an n by n positivesemidefinite matrix Qij equiv K(xi xj) = φ(xi)T

φ(xj) is the kernel Here training vec-tors are implicitly mapped into a higher (maybe infinite) dimensional space by thefunction φ

The decision function isn

sumi=1

(αi minus αlowasti )K(xi x) + ρ

214 Bayesian Ridge

Bayesian Ridge Regression holds its robustness for ill-posed problems compared toLinear Regression

This technique elaborates a probabilistic model formulated by a regression problemwith parameter w of the general Bayesian Regression solver as a spherical Gaussian

p(w∣λ) = N (w∣0 λminus1 Ip)

The scikit-learn defaults are being used to train the model α1 = α2 = λ1 = λ2 = 10minus6

In the fitting of the model process the parameters w αandλ are the one to be esti-mated together

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 21: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

12 Chapter 2 Background

215 Lasso

Lasso Regression is a linear model but fitted with `1 prior as regularizer Its objectiveis to minimize

minw

12nsamples

∣∣Xw minus y∣∣22 + α∣∣w∣∣1

This method solves the min of the least-squares penalty with α∣∣w∣∣1 added where αis a constant and ∣∣w∣∣1 is the `1minusnorm of the parameter vector

It is important to notice that this algorithm retrieves sparse models which may behelpful to perform feature selection

216 Lasso Lars

This model is trained with the Least Angle Regression (Lars) The L1 regularizationis applied

The objective function is determined by

12nsamples

∣∣y minus Xw∣∣22 + α∣∣w∣∣1

217 ARD Regression

Although this method is similar to Bayesian Ridge it may lead to sparser weights w

It drops the assumption of Gaussian being spherical making it to elliptical

Mathematicallyp(w∣λ) = N (w∣0 Aminus1)

with diag (A) = λ = λ1 λp

218 Passive Aggressive Regressor

Suitable for large scale learning they do not require a learning rate but it requires aregularization parameter c

It can be used with two different loss functions PA-I or epsilon intensive or PA-II alsoknown as squared epsilon intensive

219 Theil Sen Regressor

It is specially suited for multi-variate outliers but its efficiency decreases dramat-ically when it tries solve a high-dimensionality problem When this happens thismethod becomes similar to a Linear Regression with Ordinary Least Squares in highdimension

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 22: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

21 Machine Learning 13

2110 K-Neighbors Regressor

In this algorithm the target is predicted by n nearest neighbors used during thetraining phase It is important to notice that n is defined by the user and it will affectpositively or negatively the obtained results of the predictions

2111 Logistic Regression

Logistic Regression is mostly used for classification problems but it can be used formore than one class predictions using the log function

This scikit-learn implementation can fit binary One-vs-Rest or multinomial logisticregression with optional L2 or L1 regularization

Several solvers and regularizations were applied to the datasets and the results willbe discussed in the Experiments section

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 23: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

15

Chapter 3

Methodology

In this chapter what was tried to be achieved will be detailed In addition themethods that were used are going to be explained as well as any other necessaryinformation related to help the reader to understand the flow of the later coveredexperiments

In addition how the dataset that had being used will be displayed closing with asection about other possible datasets that can be applied and possible limitations tothe ones used in this dissertation

Finally a whole coverage of the Dataset used to perform the experiments will bedetailed

31 Dataset

Datasets for testing causal inference on observational data extracted from real lifescenarios are difficult to obtain

On the one hand the whole point of the current project - and at some extent - of thelast efforts in machine learning applied to causality are to make mostly accurate pre-dictions on a set of units patients or inputs (in the Machine Learning vocabulary) thathad been collected without the chance of previously design a carefully planned Ran-domized Controlled Trial Since the nature of the already collected or observationaldata has not been randomized properly neither it comes from the same probabil-ity distribution Also the amount of units which received the treatment versus theamount of them who did not receive the treatment could potentially differ substan-tially

On the other hand some experiments can not be designed and executed under Ran-domized Controlled Trial conditions since they are unethical or impossible to per-form For example designing a experiment to test if driving while under the effectsof alcohol is dangerous for the driver or the pedestrians tested against a control treat-ment which in this case will be driving without alcohol consumption is completelyunethical to perform for clear reasons

To solve these limitation when working on causal effects on observational data syn-thetic semi-synthetic or toy datasets are created by the researchers in order to estab-lish a good starting point and benchmark framework to try test or develop betteralgorithms that are able to make more accurate predictions surpassing the state-of-the-art results

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 24: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

16 Chapter 3 Methodology

Lastly it is important to notice that there are two different kinds of predictions forcausal inference One is the most common one to obtain in which the counter fac-tual outcomes could not be recorded because of the nature of the experiment (andbeing this the fundamental problem of causal analysis) In this cases a Policy risk func-tion π is designed to apply or not the treatment t depending on a certain thresholdθ The less possible errors when predicting the application of the active treatmentor control are the main goal when iterating over different values of the thresholdvariable for the dataset trained

32 IHDP dataset

The Infant Health and Development Programa (IHDP) (Gross 1993) was a Random-ized Controlled Trial (RCT) hold in the United States across multiple sites applyingcontrol and treatment to reduce the developmental and health problems of low birthweight of premature infants On the one hand the treated group received visits totheir homes integration at a dedicated child development center in addition to apediatric follow-up which can be described as high-quality child care On the otherhand the control group only received the pediatric follow-up

However (Hill 2011) presented a semi-synthetic (also could be mention in this workand in the field as semi-simulated) dataset that derived directly from the originalIHDP RCT (Gross 1993) mentioned in the above paragraph In (Hill 2011) somecontinuous and binary covariates from the this real life RCT were selected Mak-ing use of these covariates (Hill 2011) created a simulated outcome and generatesnon-parametric simulated outcomes for the whole population of the trial In thedataset 25 covariates of the whole study where taken for this dataset creation Con-sequently the author introduces an artificial imbalance on the control and treatmentindividuals by removing a subset of the treated population Finally the datasetcomprises of 747 subjects (units or inputs) from which 608 had not been applied thetreatment (control) and 139 treated As it can be clearly noticed the dataset end upbeing quite unbalanced especially for learning and predicting effects of the treat-ment t = 0ort = 1based on the generalization task that an algorithm can perform

Along with the covariates for each unit it can be observed the simulated causalinformation This is the effectively applied treatment (t = 0ort = 1 the observedoutcome (Yf t) the counter-factual outcome (Yc f t) and the average outcomes withnoise mu0 and mu1

In this dissertation 100 and 1000 replications of the original (Hill 2011) dataset wereused to evaluation and hyperparameter selection all with the log-linear responsesurface implemented as setting B in the NPCI package (Dorie 2016) The 100 and1000 replications were downloaded from (Johansson 2017 (accessed July 19 2018))and are the exact same files used in (Shalit Johansson and Sontag 2017 Louizoset al 2017) which are the state-of-the-art baseline that was chosen to compare in theexperiments of the present work

This dataset is nowadays a strong benchmark framework for analysis the predic-tions results of a new machine learning technique applied to causal inference onobservational data

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 25: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

33 Other articles metrics 17

33 Other articles metrics

It is worth to mention other published articles evaluating ITE ATE and PEHE errorsfor the reader to look further on them if intended The results of the papers men-tioned in this section had been collected using the same initial dataset from (Hill2011) but with slightly different methods for replications different number of runsor not specifying how many replication were used

In (Johansson Shalit and Sontag 2016b) they run the IHDP dataset(Hill 2011) on100 replication experiments in order to perform hyperparameter tuning and 1000replications for evaluation All these replications were created using the NPCI pack-age (Dorie 2016) while selecting the log-linear response surface implemented as set-ting ldquoBrdquo in the mentioned tool These results are not shown in this dissertation sincethe response surface chosen differs from the state-of-the-art results and papers pub-lished on the following years (Louizos et al 2017 Shalit Johansson and Sontag2017) In (Johansson Shalit and Sontag 2016b) to implement the BART resultsthey were based on Bayesian Additive Regression Trees (Chipman George and Mc-Culloch 2010) applying a non-linear regression model following the implementa-tion given in the BayesTree Rpackage

In a recent publication from the Proceeding of the 10th International Conference onEducational Data Mining (Estimating Individual Treatment Effect from Educational Stud-ies with Residual Counterfactual Networks) referenced and run the experiments on theIHDP dataset (Hill 2011) However the authors do not explicit the amount of replica-tions used to gather the metrics neither they express if a log-linear A or B or anyother method that was used to simulate the semi-synthetic dataset Consequentlythe results obtained by them can not be compared to this dissertation results andthey are not shown on this work

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 26: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

19

Chapter 4

Experiments

A series of runs with replications of the IHDP dataset were performed to ultimatelypredict all the factual yF and counterfactual yCF outcomes for every single unitSubsequently those values are inputed into the programming code produced by(Louizos et al 2017) in which the εITEεATE and εPEHE errors are calculated toevaluate the performance of the applied machine learning methods In this workit is of particular interest correctly predicting the Individual Treatment Effect(ITE) thataccounts for identifying the best possible action or treatment to a given unit x with itsunique covariates (features)

This is a challenging goal since as the reader might have clear by this point is thatneither the counterfactual outcome yCF nor the average treatment effect with noisemu0 mu1 can be used at all to train the regressor models Insted these three valuesas long with the factual outcome yF are used to obtain the εITEεATE and εPEHEerrors

The experiments were run on 10 100 and 1000 replications both within-sample andout-of-sample The 10 100 and 1000 replications were downloaded from (Johans-son 2017 (accessed July 19 2018)) and they are the same used to produce the resultsin (Shalit Johansson and Sontag 2017 Louizos et al 2017) from which the tableswith their state-of-the-art errors will be also displayed in this section so the readercan compare with the outcomes of this work

It is important to clarify and define here what within-sample and out-of-samplestands for The definition given by (Shalit Johansson and Sontag 2017) in its publi-cation being the same technique later followed by (Louizos et al 2017) to performcompare and show their results

Within-sample this test refers to all the errors (ITE ATE nd PEHE) made by thepredictions of the already trained model against the training and validation (if any)dataset Note bold here that this is not a trivial task since the model has alreadybeen trained with and unbalanced dataset (different number of samples in whichtreatments t = 0 and t = 1 was applied and observed) in which it is only known onetreatment applied and the factual outcome of that treatment applied to an individ-ual x isin ◁ The other problem to overcome is that in practice the population whoreceived treatment t = 1 and the population who received t = 0 might come fromcompletely different probability distributions All these are common problems ofobservational data and they were mention in the

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 27: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

20 Chapter 4 Experiments

Out-of-sample These predictions are made on a completely unseen out of trainingor validation phase with new units In this case it is naturally harder to make predic-tions since the inputs might come from even different probability distributions fromthe training phase (already potentially unbalanced) The experiment procedure isthe same predictions for t = 0 and t = 1 are made for each single unit(input) of thetesting dataset to later determine the errors ITE ATE and PEHE

Once the model is trained it predicts for the each one of the inputs (units) usingthe treatment value t = 0 and consequently they predictions are made setting allthe values of the treatment to t = 1 The subtraction between this two predictionsfor each input is known as the ITE and will ultimately define if the patient wouldbe benefited or not by applying the treatment Mathematically it is represented byE[Y1 minusY0∣x]The machine learning algorithms implemented in python programming code by(User guide contents mdash scikit-learn 0192 documentation) were run with the defaulthyperparameters to obtain the above mentioned metrics that are finally shown inthe tables displayed in this chapter The hyperparameter tunning was done with100 replications following the same methodology of the compared methods in thepreviously mentioned publications

41 Machine learning methods applied to IHDP dataset

First traditional out of the shelf machine learning methods were applied to the 10replications IHDP dataset

Their medians and variances across the 10 Replications for within-sample run aredisplayed in the Table 41 whereas in Table 42 it can be observed the out-of-sampleerrors (the lower the better) for 10 Replication of the IHDP dataset with the samealgorithms

TABLE 41 IHDP 10 replications with traditional machine learningalgorithms - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 262 plusmn 114 094 plusmn 035 273 plusmn 123BayesianRidge 390 plusmn 199 097 plusmn 067 480 plusmn 280LassoLars 476 plusmn 125 467 plusmn 057 740 plusmn 255Lasso 476 plusmn 125 467 plusmn 057 740 plusmn 255ARDRegression 392 plusmn 201 097 plusmn 074 480 plusmn 281PassiveAggressiveRegressor 439 plusmn 209 154 plusmn 107 497 plusmn 292TheilSenRegressor 393 plusmn 199 089 plusmn 063 478 plusmn 279BaggingRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231KNeighboursRegressor 514 plusmn 167 357 plusmn 047 627 plusmn 231LinearRegression 392 plusmn 201 089 plusmn 065 479 plusmn 279

In the next experiment Tables 44 and 44 show 100 Replications of the IHDP datasetwere taking into account both for within-sample and out-of-sample respectively Inthis case it is remarkable that the split in between training and testing was performonly over the training dataset randomly The intention was to prove if the results

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 28: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

41 Machine learning methods applied to IHDP dataset 21

TABLE 42 IHDP 10 replications with traditional machine learningalgorithms - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 226 plusmn 063 124 plusmn 062 234 plusmn 098BayesianRidge 354 plusmn 167 182 plusmn 135 413 plusmn 223LassoLars 430 plusmn 089 548 plusmn 125 695 plusmn 206Lasso 430 plusmn 089 548 plusmn 125 695 plusmn 206ARDRegression 357 plusmn 170 183 plusmn 141 414 plusmn 227PassiveAggressiveRegressor 419 plusmn 194 238 plusmn 175 445 plusmn 249TheilSenRegressor 362 plusmn 168 176 plusmn 130 408 plusmn 221BaggingRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181KNeighboursRegressor 427 plusmn 118 392 plusmn 095 563 plusmn 181LinearRegression 358 plusmn 170 177 plusmn 133 409 plusmn 222

obtained for the following experiments with the datasets already split into train andtest downloaded from (Johansson 2017 (accessed July 19 2018)) are similar or differdramatically

TABLE 43 IHDP 100 replications - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 317 plusmn 040 082 plusmn 009 330 plusmn 042BayesianRidge 449 plusmn 060 086 plusmn 016 565 plusmn 083LassoLars 476 plusmn 036 457 plusmn 017 790 plusmn 077Lasso 476 plusmn 036 457 plusmn 017 790 plusmn 077ARDRegression 449 plusmn 060 081 plusmn 016 564 plusmn 083PassiveAggressiveRegressor 549 plusmn 075 083 plusmn 014 566 plusmn 083TheilSenRegressor 445 plusmn 059 079 plusmn 015 563 plusmn 083BaggingRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070KNeighboursRegressor 535 plusmn 049 346 plusmn 014 678 plusmn 070LinearRegression 453 plusmn 060 079 plusmn 016 563 plusmn 083

TABLE 44 IHDP 100 replications - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 279 plusmn 027 086 plusmn 013 325 plusmn 042BayesianRidge 427 plusmn 054 102 plusmn 026 537 plusmn 078LassoLars 475 plusmn 039 451 plusmn 023 757 plusmn 071Lasso 475 plusmn 039 451 plusmn 023 757 plusmn 071ARDRegression 427 plusmn 054 100 plusmn 026 536 plusmn 078PassiveAggressiveRegressor 528 plusmn 069 100 plusmn 021 536 plusmn 077TheilSenRegressor 424 plusmn 053 099 plusmn 025 535 plusmn 078BaggingRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063KNeighboursRegressor 493 plusmn 043 319 plusmn 018 623 plusmn 063LinearRegression 431 plusmn 055 099 plusmn 026 536 plusmn 079

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 29: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

22 Chapter 4 Experiments

Consequently in Table 45 and Table 46 it can be observed the results for the al-ready split in training and test obtained from (Johansson 2017 (accessed July 192018)) which accounts for the exact same dataset used in (Louizos et al 2017 ShalitJohansson and Sontag 2017) As mentioned in this thesis the hyperparameter tun-ning was performed on this number of replications if any

TABLE 45 IHDP 100 replications already split dataset - Within sam-ple

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 305 plusmn 038 076 plusmn 008 317 plusmn 040BayesianRidge 444 plusmn 058 080 plusmn 014 561 plusmn 083LassoLars 476 plusmn 036 455 plusmn 017 788 plusmn 076Lasso 476 plusmn 036 455 plusmn 017 788 plusmn 076ARDRegression 445 plusmn 059 077 plusmn 015 561 plusmn 083PassiveAggressiveRegressor 503 plusmn 062 083 plusmn 013 563 plusmn 082TheilSenRegressor 440 plusmn 057 072 plusmn 013 560 plusmn 082BaggingRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069KNeighboursRegressor 531 plusmn 048 341 plusmn 014 672 plusmn 069LinearRegression 448 plusmn 059 075 plusmn 014 560 plusmn 082

TABLE 46 IHDP 100 replications already split dataset - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 284 plusmn 028 073 plusmn 007 349 plusmn 049BayesianRidge 441 plusmn 057 081 plusmn 011 574 plusmn 089LassoLars 465 plusmn 034 431 plusmn 014 796 plusmn 082Lasso 465 plusmn 034 431 plusmn 014 796 plusmn 082ARDRegression 442 plusmn 058 078 plusmn 011 573 plusmn 089PassiveAggressiveRegressor 495 plusmn 059 101 plusmn 017 578 plusmn 089TheilSenRegressor 438 plusmn 056 085 plusmn 013 574 plusmn 089BaggingRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075KNeighboursRegressor 495 plusmn 046 298 plusmn 010 665 plusmn 075LinearRegression 445 plusmn 058 078 plusmn 011 573 plusmn 089

With 1000 replications it can be compared both within sample and out-of-samplewith the results obtained in (Shalit Johansson and Sontag 2017 Louizos et al2017) The same semi-synthetic dataset IHDP by (Hill 2011) with log-linear responsesetting A generated using the code from (Dorie 2016) was used to perform bothtype of measures

It can be observed four different tables for the 1000 replications In pairs twoof them (Tables 47 48) were obtained without normalization of the input features(covariates) the other couple was obtained by scaling from [0 1] using the Min-MaxScaler() from the scikit-learn library The results improved not significantly butenough for keeping the scaling as the presented final result of the methods in fol-lowing section

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdash-

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 30: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

41 Machine learning methods applied to IHDP dataset 23

TABLE 47 IHDP 100 replications - No scaling - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 309 plusmn 012 069 plusmn 002 321 plusmn 013BayesianRidge 459 plusmn 019 078 plusmn 004 581 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 541 plusmn 022 090 plusmn 005 585 plusmn 026TheilSenRegressor 455 plusmn 018 070 plusmn 003 579 plusmn 026BaggingRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021KNeighboursRegressor 531 plusmn 015 328 plusmn 004 676 plusmn 021LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

TABLE 48 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 281 plusmn 009 078 plusmn 003 337 plusmn 014BayesianRidge 457 plusmn 019 098 plusmn 005 579 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 542 plusmn 022 113 plusmn 007 583 plusmn 027TheilSenRegressor 454 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022KNeighboursRegressor 495 plusmn 014 309 plusmn 005 654 plusmn 022LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

TABLE 49 IHDP 100 replications - Scaled - Within sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 238 plusmn 008 033 plusmn 002 277 plusmn 012BayesianRidge 458 plusmn 019 073 plusmn 004 580 plusmn 026LassoLars 465 plusmn 010 440 plusmn 004 791 plusmn 024Lasso 465 plusmn 010 440 plusmn 004 791 plusmn 024ARDRegression 459 plusmn 019 076 plusmn 004 580 plusmn 026PassiveAggressiveRegressor 547 plusmn 022 102 plusmn 006 588 plusmn 026TheilSenRegressor 468 plusmn 019 069 plusmn 003 579 plusmn 026BaggingRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021KNeighboursRegressor 477 plusmn 013 267 plusmn 003 637 plusmn 021RANSACRegressor 493 plusmn 020 164 plusmn 009 609 plusmn 026HuberRegressor 444 plusmn 018 067 plusmn 003 579 plusmn 025ElasticNet 465 plusmn 010 440 plusmn 004 791 plusmn 024LinearRegression 463 plusmn 019 073 plusmn 004 580 plusmn 026

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 31: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

24 Chapter 4 Experiments

TABLE 410 IHDP 1000 replications - No Scaling - Out-of-sample

εITE εATEradic

εPEHE

Support Vector Regressor (SVG) 244 plusmn 008 045 plusmn 003 281 plusmn 013BayesianRidge 455 plusmn 019 095 plusmn 005 578 plusmn 026LassoLars 466 plusmn 011 441 plusmn 005 790 plusmn 024Lasso 466 plusmn 011 441 plusmn 005 790 plusmn 024ARDRegression 458 plusmn 019 096 plusmn 005 578 plusmn 026PassiveAggressiveRegressor 544 plusmn 022 118 plusmn 007 587 plusmn 026TheilSenRegressor 468 plusmn 019 095 plusmn 005 578 plusmn 026BaggingRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022KNeighboursRegressor 446 plusmn 013 233 plusmn 004 612 plusmn 022RANSACRegressor 491 plusmn 020 173 plusmn 009 606 plusmn 027HuberRegressor 444 plusmn 018 092 plusmn 005 577 plusmn 026ElasticNet 466 plusmn 011 441 plusmn 005 790 plusmn 024LinearRegression 461 plusmn 019 094 plusmn 005 578 plusmn 026

Consequently Logistic Regression with multi-class as multinomial predictor hasbeen applied The performance is way bellow the regressors being the main rea-son that when encoding the target values to assign them a probability these are notthe same that are needed to be predicted Also when decoding the predictions pre-cision is lost The l2 norm has been used with two different solvers newton-cg andlbfgs This results are displayed in Table 411 and Table 412

TABLE 411 IHDP 100 replications logistic regressions - Within sam-ple

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 777 plusmn 076 440 plusmn 017 777 plusmn 076LogisticRegression - L2 (lbfgs) 777 plusmn 076 440 plusmn 017 777 plusmn 076

TABLE 412 IHDP 100 replications logistic regressions - Out-of-sample

εITE εATEradic

εPEHE

LogisticRegression - L2 (NEWTON-CG) 590 plusmn 057 241 plusmn 011 721 plusmn 085LogisticRegression - L2 (lbfgs) 590 plusmn 057 241 plusmn 011 721 plusmn 085

mdashmdashmdashmdashmdashmdashmdashmdashmdashmdashmdash

From all these tables the method which obtained the best results was consistentlythe Support Vector Regressor Therefore a few runs of hyper-parameters tunningwere done The errors observed were even smaller so the final hyper-parametersselected for this dataset were Radial Basis Function (rbf) C=1e3 and gamma=001The selection was performed within sample and out-of-sample but for 100 repli-cations of the dataset this is the same method the authors (Shalit Johansson andSontag 2017 Johansson Shalit and Sontag 2016a Louizos et al 2017) state to use

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 32: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

41 Machine learning methods applied to IHDP dataset 25

for their own hyper-parameter selection In Table 413 and Table 414 the results ofrunning SVR hyperparameter selection with the final results shown

TABLE 413 IHDP 100 replications SVR Hyper-parameters tunning -Within sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e3-g005 271 plusmn 034 045 plusmn 007 278 plusmn 036SVR-rbf-1e3-g001 235 plusmn 029 024 plusmn 003 232 plusmn 031SVR-rbf-1e3-g0001 365 plusmn 045 052 plusmn 009 451 plusmn 065SVR-rbf-1e3-g00001 428 plusmn 055 076 plusmn 011 561 plusmn 082SVR-rbf-1e3-g000001 425 plusmn 052 149 plusmn 010 597 plusmn 081SVR-rbf-1e10-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e20-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-rbf-1e30-g01 317 plusmn 040 082 plusmn 009 330 plusmn 042SVR-poly-1e3-degree2 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree1 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e3-degree4 250 plusmn 029 028 plusmn 003 253 plusmn 030SVR-poly-1e10-degree2 299 plusmn 034 041 plusmn 006 300 plusmn 039

TABLE 414 IHDP 100 replications SVR Hyper-parameters tunning -Out-of-sample

εITE εATEradic

εPEHE

SVR-rbf-1e3-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e3-g005 266 plusmn 024 053 plusmn 010 271 plusmn 035SVR-rbf-1e3-g001 250 plusmn 023 031 plusmn 005 226 plusmn 031SVR-rbf-1e3-g0001 345 plusmn 040 077 plusmn 016 423 plusmn 062SVR-rbf-1e3-g00001 409 plusmn 050 096 plusmn 021 531 plusmn 077SVR-rbf-1e3-g000001 405 plusmn 047 159 plusmn 018 565 plusmn 075SVR-rbf-1e10-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e20-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-rbf-1e30-g01 279 plusmn 027 086 plusmn 013 325 plusmn 042SVR-poly-1e3-degree2 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree1 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e3-degree4 287 plusmn 022 038 plusmn 005 248 plusmn 029SVR-poly-1e10-degree2 321 plusmn 033 048 plusmn 006 295 plusmn 039

Finally the final results obtained by this thesis and the run experiments are dis-played in Table 410 and Table 49 whereas in Table 415 and Table 416 show theresults obtained in publication (Shalit Johansson and Sontag 2017)

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 33: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

26 Chapter 4 Experiments

TABLE 415 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 73 plusmn 04OLSLR-2 24 plusmn 1 14 plusmn 01BLR 58 plusmn 3 72 plusmn 04k-NN 21 plusmn 1 14 plusmn 01TMLE 50 plusmn 2 30 plusmn 01BART 21 plusmn 1 23 plusmn 01RANDFOR 42 plusmn 2 73 plusmn 05CAUSFOR 38 plusmn 2 18 plusmn 01BNN 22 plusmn 1 37 plusmn 03TARNET 88 plusmn 0 26 plusmn 01CFR MMD 73 plusmn 0 30 plusmn 01CFR WASS 71 plusmn 0 25 plusmn 01

Within sample IHDP 1000 replications

TABLE 416 ICML 2017 - Estimating individual treatment effectgeneralization bounds and algorithms (Shalit Johansson and Son-

tag 2017)

radicεPEHE εATE

OLSLR-1 58 plusmn 3 94 plusmn 06OLSLR-2 25 plusmn 1 31 plusmn 02BLR 58 plusmn 3 93 plusmn 05k-NN 41 plusmn 2 79 plusmn 05BART 23 plusmn 1 34 plusmn 02RANDFOR 66 plusmn 3 96 plusmn 06CAUSFOR 38 plusmn 2 40 plusmn 03BNN 21 plusmn 1 42 plusmn 03TARNET 95 plusmn 0 28 plusmn 01CFRMMD 78 plusmn 0 31 plusmn 01CFRWASS 76 plusmn 0 27 plusmn 01

Out-of-sample IHDP 1000 replications

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 34: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

42 Other experiments 27

42 Other experiments

421 Recursive Feature Elimination

Even thought in the Related Work section powerful Feature Selection methods im-plementation publications were shown there was a experimented performed inthe developed code that performs machine learning Recursive Feature Elimination(RFE) using the sci-kit learn library framework

In addition assuming Strong Ignorability on the dataset studied it would not be ap-propriate to perform such experiment but due to the nature of the machine learningregressors and their sensibility to highly correlated input features this might reliefsome of the errors made by them

The results perform significantly worst in all algorithms for the causality inferencemetrics detailed in this work thus the results are not show but can be revised by thereader in the code implementation for further analysis

422 Domain Adaptation Neural Networks

A Domain Adaptation Neural Networks implementation was tested on just 10 repli-cations of the IHDP dataset However the code in the github repository of DrSpyros Samothrakis was executed to obtain the results for 10 replications in thiswork the code uploaded contains the straightforward implementation for the 1000replications used in the other experiments

The results shown bellow in Table 417 clearly state promising results Althoughthe results are not directly comparable with the ones in the previous subsection thecode uploaded is ready to run the 1000 replication in a GPU powered machine Interms of CPU the estimated finished time was about 4 days and a half with a IntelDual Core i7

Domain Adaptation algorithms are a promising field to explore ITE and ATE pre-dictions due to its architectural design

TABLE 417 Domain Adaptation Neural Networks

εITE εATEradic

εPEHEDANN (Within-sample) 118plusmn017 012 plusmn 004 102 plusmn 048DANN (Out-of-sample) 120plusmn011 017 plusmn 008 076 plusmn 023

Within-sample and Out-of-sample IHDP 10 replications

43 Discussion

As it can be clearly noticed machine learning regressor algorithms applied in thisdissertation are very close to the one obtained by the work published by the citedcompared authors

It is remarkable that no custom metric function Integral Probability Metric to over-come the unbalanced treated dataset or any other custom loss function were appliedto obtain the shown results in Table 410 and Table 49

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 35: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

28 Chapter 4 Experiments

It seems to be an excessive amount of effort from the authors that leads into compli-cated methods not gaining much more into causality prediction from observationaldata

However they claim that with more unbalanced representations of the feature spaceor treatment assignment their methods can help to overcome this problem muchbetter than out-of-the-box machine learning algorithms No other metrics are re-ported on heavily unbalanced treatment assignment datasets

Finally the 10 replications of the Domain Adaptation Neural Networks training andtesting errors showed promising results that needs to be addressed in future works

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 36: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

29

Chapter 5

Conclusions

51 Concluding Remarks

The results obtained applying machine learning regressors with significant less orno added complexity out-of-the-box to predict the factual and counterfactual out-comes given a unit are close to the ones obtained in more elaborated and customtechniques that are implemented as the state-of-the-art performance results

It has to be taken into account that no custom metric functions nor special prepro-cessing steps except from scaling the features which is part of the must do taskswhen using machine learning algorithms) have been performed to achieve similarresults to the state-of-the-art metrics achieved in the mentioned and compared pa-pers in

In addition this dissertation shows the results for not applied before machine learn-ing techniques on the adopted benchmark IHDP dataset for performing predictionson both the factual and counterfactual outcomes to later present the ITE ATE andPEHE error calculations This was the main goal of this thesis but it changed whenthe obtained metrics were almost as close as the ones in the state-of-the-art numbers

It is important to notice that there are machine learning techniques that had been in-troduced in the last years that are potentially more suitable than both machine learn-ing regressors and custom or generalized metric and error functions like DomainAdaptation Neural Networks as well as other methods from the Deep Learning lit-erature Moreover there are continuous space causality from observational data thatinclude more than two possible outcomes to apply that are substantially more suit-able to solve with Reinforcement Learning algorithms better than any other DeepNeural Network or Regressor

Finally this work is intended to cover a considerably empty space of straightforwarddefinitions to apply machine learning to causality Although in the last two yearsseveral noticeable papers were published there are difficult to follow when relatingterms from the causal inference field to the computer and data sciences backgroundresearchers I gave my best to compile define explain detail and relate causalinference with machine learning terminology

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 37: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

30 Chapter 5 Conclusions

52 Future work

Future directions on this work will include four different approaches that should betaken

First it would be important to try the applied machine learning methods to otherbenchmark datasets and compare the results with other published papers and algo-rithms with the same or more complexity

Second extending the functionality developed for a binary treatment and a contin-uous output to a multi-valued treatment To the best of my knowledge it should notbe costly to perform such modification in the code however at least one new datasetthat supports this kind of treatment size would need to be processed

Third applying this method to perform binary factual predictions and Policy Riskthreshold for which a treatment should be applied or not should be an importantnext step regarding causal inference from observational data The machine learningalgorithms applied in this work are suitable to test with this type of datasets solvinga common real life problem in the field

Fourth implement Domain Adaptation Neural Networks on the IHDP 1000 repli-cations dataset is a very promising task due to both the architectural design of thealgorithm as well as the outperforming state-of-the-art precision that they had forthe experiment run

Lastly the application of these methods on causal datasets that accounts for out-comes that varies against the application of time and applied treatments are framedwithin time series problems in the continuous space These kind of datasets will bepossibly the next focus on the researchers of machine learning applied to treatmentsapplied over time

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 38: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

31

Bibliography

Alaa Ahmed M Michael Weisz and Mihaela Van Der Schaar (2017) Deep Counter-factual Networks with Propensity-Dropout Tech rep arXiv arXiv170605966v1URL httpsarxivorgpdf170605966pdf

Arjas Elja and Jan Parner Causal Reasoning from Longitudinal Data DOI 1023074616822 URL httpswwwjstororgstable4616822

Atan Onur William R Zame and Mihaela Van Der Schaar (2018) Counterfactual Pol-icy Optimization Using Domain-Adversarial Neural Networks Tech rep URL httpmedianetlabeeuclaedupaperscf_treat_v5

Atan Onur et al (2016) Constructing Effective Personalized Policies Using Counterfac-tual Inference from Biased Data Sets with Many Features Tech rep arXiv arXiv161208082v1

Athey Susan and Guido Imbens (2016) ldquoRecursive partitioning for heterogeneouscausal effectsrdquo In Proceedings of the National Academy of Sciences of the United Statesof America 11327 pp 7353ndash60 ISSN 1091-6490 DOI 101073pnas1510489113URL httpwwwncbinlmnihgovpubmed27382149httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4941430

Athey Susan and Guido W Imbens Recursive Partitioning for Heterogeneous CausalEffects Tech rep arXiv arXiv150401132v3 URL httpsarxivorgpdf150401132pdf

Austin Peter C (2011) ldquoAn Introduction to Propensity Score Methods for Reduc-ing the Effects of Confounding in Observational Studiesrdquo In Multivariate behav-ioral research 463 pp 399ndash424 ISSN 1532-7906 DOI 101080002731712011568786 URL httpwwwncbinlmnihgovpubmed21818162httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3144483

Bang Heejung and James M Robins (2005) ldquoDoubly Robust Estimation in MissingData and Causal Inference Modelsrdquo In DOI 101111j1541-0420200500377x

Ben-David Shai et al (2007) Analysis of Representations for Domain Adaptation URLhttpspapersnipsccpaper2983-analysis-of-representations-for-domain-adaptation

Beygelzimer Alina and John Langford (2008) ldquoThe Offset Tree for Learning withPartial Labelsrdquo In arXiv 08124044 URL httparxivorgabs08124044

Blitzer John Ryan McDonald and Fernando Pereira (2006) ldquoDomain adaptationwith structural correspondence learningrdquo In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing - EMNLP rsquo06 Morristown NJUSA Association for Computational Linguistics p 120 ISBN 1932432736 DOI10311516100751610094 URL httpportalacmorgcitationcfmdoid=16100751610094

Bottou Leacuteon et al (2013) Counterfactual Reasoning and Learning Systems The Exam-ple of Computational Advertising Tech rep pp 3207ndash3260 URL httpswwwmicrosoftcomen-usresearchwp-contentuploads201311bottou13apdf

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 39: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

32 Bibliography

Chernozhukov Victor et al (2016) ldquoDoubleDebiased Machine Learning for Treat-ment and Causal Parametersrdquo In arXiv 160800060 URL httparxivorgabs160800060

Chipman Hugh A Edward I George and Robert E McCulloch (2010) ldquoBARTBayesian additive regression treesrdquo In The Annals of Applied Statistics 41 pp 266ndash298 ISSN 1932-6157 DOI 10121409-AOAS285 URL httpprojecteuclidorgeuclidaoas1273584455

Daumeacute Hal (2009) ldquoFrustratingly Easy Domain Adaptationrdquo In arXiv 09071815URL httparxivorgabs09071815

Dorie Vincent (2016) NPCI Non-parametrics for Causal Inference URL https githubcomvdorienpci

Doroudi Shayan Philip S Thomas and Emma Brunskill (2017) Importance Samplingfor Fair Policy Selection Tech rep URL httpswwwijcaiorgproceedings20180729pdf

Dudik Miroslav John Langford and Lihong Li (2011) ldquoDoubly Robust Policy Eval-uation and Learningrdquo In arXiv 11034601 URL httparxivorgabs11034601

Gan Chuang et al (2016) ldquoWebly-Supervised Video Recognition by Mutually Vot-ing for Relevant Web Images and Web Video Framesrdquo In pp 849ndash866 DOI 101007978-3-319-46487-9_52 URL httplinkspringercom101007978-3-319-46487-9_52

Gelman Andrew and Jennifer Hill (2007) Data analysis using regression and multi-levelhierarchical models Cambridge University Press p 625 ISBN 9780521686891

Gross Ruth T (1993) Infant Health and Development Program (IHDP) Enhancing theOutcomes of Low Birth Weight Premature Infants in the United States 1985-1988 DOI103886ICPSR09795v1

Hill Jennifer L (2011) ldquoBayesian Nonparametric Modeling for Causal InferencerdquoIn Journal of Computational and Graphical Statistics 201 pp 217ndash240 ISSN 1061-8600 DOI 101198jcgs201008162 URL httpwwwtandfonlinecomdoiabs101198jcgs201008162

Hoiles William and Mihaela Van Der Schaar (2016) ldquoBounded Off-policy Evalua-tion with Missing Data for Course Recommendation and Curriculum Designrdquo InICMLrsquo16 pp 1596ndash1604 URL httpdlacmorgcitationcfmid=30453903045559

Hoyer Patrik O et al Nonlinear causal discovery with additive noise models Tech repURL https is tuebingen mpg de fileadmin user _ upload files publicationsNIPS2008-Hoyer-neu_5406[0]pdf

Jiang Nan and Lihong Li (2015) ldquoDoubly Robust Off-policy Value Evaluation forReinforcement Learningrdquo In arXiv 151103722 URL httparxivorgabs151103722

Joachims Thorsten and Adith Swaminathan (2016) ldquoCounterfactual Evaluation andLearning for Search Recommendation and Ad Placementrdquo In Proceedings of the39th International ACM SIGIR Conference on Research and Development in InformationRetrieval SIGIR rsquo16 Pisa Italy ACM pp 1199ndash1201 ISBN 978-1-4503-4069-4 DOI10114529114512914803 URL httpdoiacmorg10114529114512914803

Johansson Fredrik D (2017 (accessed July 19 2018)) Fredrik D Johansson PhD - MITPersonal Website URL httpsstuffmiteduafsathenamiteduuserfrfredrikjwww

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 40: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

Bibliography 33

Johansson Fredrik D Uri Shalit and David Sontag ldquoLearning Representations forCounterfactual Inferencerdquo In () URL httpspeoplecsailmitedudsontagpapersJohanssonShalitSontag_icml16pdf

ndash (2016a) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

ndash (2016b) Learning Representations for Counterfactual Inference Tech rep URL httpproceedingsmlrpressv48johansson16pdf

Kang Joseph D Y and Joseph L Schafer (2007) ldquoDemystifying Double RobustnessA Comparison of Alternative Strategies for Estimating a Population Mean fromIncomplete Datardquo In Statistical Science 224 pp 523ndash539 ISSN 0883-4237 DOI10121407-STS227 URL httpprojecteuclidorgeuclidss1207580167

Lang Ken (1995) ldquoNewsWeeder Learning to Filter Netnewsrdquo In Machine LearningProceedings 1995 Elsevier pp 331ndash339 DOI 101016B978- 1- 55860- 377- 650048-7

Lok Judith J (2008) ldquoStatistical modeling of causal effects in continuous timerdquoIn The Annals of Statistics 363 pp 1464ndash1507 ISSN 0090-5364 DOI 10 1214 009053607000000820 URL httpprojecteuclidorgeuclidaos1211819571

Louizos Christos et al (2017) ldquoCausal Effect Inference with Deep Latent-VariableModels arXiv 1705 08821v2 [ stat ML ] 6 Nov 2017rdquo In Nips arXiv arXiv170508821v2

Maathuis Marloes H et al (2010) ldquoPredicting causal effects in large-scale systemsfrom observational datardquo In Nature Methods 74 pp 247ndash248 ISSN 1548-7091DOI 101038nmeth0410-247 URL httpwwwncbinlmnihgovpubmed20354511httpwwwnaturecomarticlesnmeth0410-247

Mooij Joris M et al (2016) Distinguishing Cause from Effect Using Observational DataMethods and Benchmarks Tech rep pp 1ndash102 URL httpjmlrorgpapersvolume1714-51814-518pdf

Morgan Stephen L and Christopher Winship (2014) Counterfactuals and Causal In-ference Cambridge Cambridge University Press ISBN 9781107587991 DOI 101017CBO9781107587991

Nahum-Shani Inbal et al (2012) ldquoQ-learning A data analysis method for construct-ing adaptive interventionsrdquo In Psychological Methods 174 pp 478ndash494 ISSN 1939-1463 DOI 101037a0029373 URL httpwwwncbinlmnihgovpubmed23025434httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC3747013httpdoiapaorggetdoicfmdoi=101037a0029373

Prentice Ross (1976) ldquoUse of the Logistic Model in Retrospective Studiesrdquo In Bio-metrics 323 p 599 ISSN 0006341X DOI 1023072529748 URL httpswwwjstororgstable2529748origin=crossref

Paduraru Cosmin et al (2012) An Empirical Analysis of Off-policy Learning in Dis-crete MDPs Tech rep pp 89ndash101 URL httpproceedingsmlrpressv24paduraru12apaduraru12apdf

Robins James (1986) ldquoA new approach to causal inference in mortality studies witha sustained exposure periodmdashapplication to control of the healthy worker sur-vivor effectrdquo In Mathematical Modelling 79-12 pp 1393ndash1512 ISSN 0270-0255DOI 1010160270- 0255(86)90088- 6 URL httpswwwsciencedirectcomsciencearticlepii0270025586900886

Rosenbaum Paul R (2002) ldquoObservational Studiesrdquo In pp 1ndash17 DOI 101007978-1-4757-3692-2_1 URL httplinkspringercom101007978-1-4757-3692-2_1

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography
Page 41: Machine Learning for causal Inference on Observational Datarepository.essex.ac.uk/24772/1/HernanBorre_Master_Thesis_2018v3.… · School of Computer Science and Electronic Engineering

34 Bibliography

Rosenbaum Paul R and Donald B Rubin (1983) The Central Role of the PropensityScore in Observational Studies for Causal Effects Tech rep 1 pp 41ndash55 URL httpwwwstatcmuedu~ryantibsjournalclubrosenbaum_1983pdf

Rubin Donald B Causal Inference Using Potential Outcomes Design Modeling Deci-sions DOI 10230727590541 URL httpswwwjstororgstable27590541

Rubin Donald B (1978) ldquoBayesian Inference for Causal Effects The Role of Ran-domizationrdquo In The Annals of Statistics 61 pp 34ndash58 ISSN 0090-5364 DOI 101214aos1176344064

Rubin Donald B (2005) ldquoCausal Inference Using Potential Outcomesrdquo In Journal ofthe American Statistical Association 100469 pp 322ndash331 DOI 101198016214504000001880eprint httpsdoiorg101198016214504000001880 URL httpsdoiorg101198016214504000001880

Schulam Peter and Suchi Saria Reliable Decision Support using Counterfactual ModelsTech rep arXiv arXiv170310651v4 URL httpsarxivorgpdf170310651pdf

Shalit Uri Fredrik D Johansson and David Sontag (2017) Supplemental Materials forEstimating individual treatment effect generalization bounds and algorithms A ProofsTech rep URL httpproceedingsmlrpressv70shalit17ashalit17a-supppdf

Shalit Uri and David Sontag (2016) ldquoCAUSAL INFERENCE FOR OBSERVATIONALSTUDIESrdquo In URL httpscsnyuedu~shalitslidespdf

Sutton Richard S and Andrew G Barto (2017) Complete Draft Tech rep URLhttpincompleteideasnetbookbookdraft2017nov5pdf

Swaminathan Adith and Thorsten Joachims (2015a) Counterfactual Risk Minimiza-tion Learning from Logged Bandit Feedback URL httpproceedingsmlrpressv37swaminathan15html

ndash (2015b) The Self-Normalized Estimator for Counterfactual LearningTekin Cem and Mihaela Van Der Schaar (2018) Episodic Multi-armed Bandits Tech

rep arXiv arXiv150800641v4 URL httpsarxivorgpdf150800641pdfTian Lu et al (2014) ldquoA Simple Method for Estimating Interactions between a Treat-

ment and a Large Number of Covariatesrdquo In Journal of the American Statistical As-sociation 109508 pp 1517ndash1532 ISSN 0162-1459 DOI 101080016214592014951443 URL httpwwwncbinlmnihgovpubmed25729117httpwwwpubmedcentralnihgovarticlerenderfcgiartid=PMC4338439

Triantafillou Sofia and Ioannis Tsamardinos (2015) Constraint-based Causal Discoveryfrom Multiple Interventions over Overlapping Variable Sets Tech rep pp 2147ndash2205URL httpjmlrorgpapersvolume16triantafillou15atriantafillou15apdf

User guide contents mdash scikit-learn 0192 documentation URL httpscikit-learnorgstableuser_guidehtml (visited on 08172018)

Wager Stefan and Susan Athey (2015) ldquoEstimation and Inference of HeterogeneousTreatment Effects using Random Forestsrdquo In arXiv 151004342 URL httparxivorgabs151004342

ndash (2017) Estimation and Inference of Heterogeneous Treatment Effects using Random Forests Tech rep arXiv arXiv151004342v4 URL httparxivorgabs14050352

Zhang Kun et al (2013) Domain Adaptation under Target and Conditional Shift URLhttpproceedingsmlrpressv28zhang13dhtml

Zhao Siyuan and Neil Heffernan Estimating Individual Treatment Effect from Educa-tional Studies with Residual Counterfactual Networks Tech rep

  • Declaration of Authorship
  • Abstract
  • Acknowledgements
  • Introduction
    • Motivation
    • Purpose and Research Question
    • Approach and Methodology
    • Scope and Limitation
      • Background
        • Rubin-Newman Causal Model
          • The fundamental problem of causal analysis
          • Metrics for Causality
          • Assumptions
          • Definitions
          • Related Work
            • Machine Learning
              • Ordinary Least Squares (Linear Regression)
              • Ridge Regression
              • Support Vector Regressor
              • Bayesian Ridge
              • Lasso
              • Lasso Lars
              • ARD Regression
              • Passive Aggressive Regressor
              • Theil Sen Regressor
              • K-Neighbors Regressor
              • Logistic Regression
                  • Methodology
                    • Dataset
                    • IHDP dataset
                    • Other articles metrics
                      • Experiments
                        • Machine learning methods applied to IHDP dataset
                        • Other experiments
                          • Recursive Feature Elimination
                          • Domain Adaptation Neural Networks
                            • Discussion
                              • Conclusions
                                • Concluding Remarks
                                • Future work
                                  • Bibliography

Recommended