The influence of interpretable machine learning on
human accuracy
A study on the increased accuracy of a LIME-explanator on a classification
test
Rens Sturm
Master Thesis
MSc Marketing Intelligence
June 11th, 2020
2
The influence of interpretable machine learning on
human accuracy
A study on the increased accuracy of a LIME-explanator on a classification
test
By: Rens Sturm
University of Groningen
Faculty of Economics and Business (FEB)
Department: Marketing
Master: Marketing Intelligence
June 2020
First supervisor: K. Dehmamy
Second Supervisor: J. Wierenga
Saffierstraat 22, 9743LH
Groningen 06-83773819
Student number: S3856593
3
Management Summary
The field of machine learning is growing at an unprecedented rate increasing the applications in
everyday life. A new prominent role of machine learning is the automatization or support of decision
making for businesses, courts and governments, helping them to make faster and better decisions.
The rapid expansion of machine learning decision making has caused unease with academics,
consumer groups and legal experts. While the reasons for unease vary, one of the major aspects is
the fact that while the machine learning models make or support decisions, they do not provide an
explanation nor supporting arguments. Together with the fact that machine learning models are, like
all systems, fallible, there has been an increasing demand for interpretable machine learning models.
This research tests whether an interpretation mechanism included in the machine learning model
increases the trust in the model and whether an explanatory mechanism makes the decision-maker
more accurate. To do this, the researchers trained a neural network, which is a type of machine
learning model, using a dataset containing all passengers on the Titanic which sank in 1912. After
training the machine learning model, it was made interpretable by using a Local Interpretable Model-
agnostic Explanation (LIME). The interpretable model was used by 145 participants to estimate
whether certain individuals had survived the Titanic-disaster.
It was found that using an explanation mechanism positively influences the accuracy of the
participants in estimating survival-rates in a significant way (B = 0.1037, p = 0.003). Second, having
an explanatory mechanism also increase the trust of the participant in the model (B=0.8955, p = 8.8e-
08). We found women to be slightly better at predicting survival rates than men, even though the
explanation for this might have more to do with the methodology. We found no correlation between
expertise on the Titanic and accuracy, nor between experience with machine learning models and
accuracy.
Managers can use these insights in critical areas where a boost in accuracy can increase benefits a
large amount. Companies that are using machine learning already, for recommendation or
automated decision making, can use explanatory mechanisms to increase the trust of the consumer
in the model. Further research is needed for a generalization of interpretable machine learning in
other areas than decision-making support and classification problems.
Keywords: Machine learning, interpretable machine learning, neural networks, LIME
4
Preface
The writing of a thesis is a rite of passage underwent by all students. For me the journey was
thoroughly enjoyable being able to be immersed in the field of machine learning and write about it.
Combining the new field of machine learning with old history like the Titanic was especially
satisfactory. I am happy that I was able to sneak in a little history at last. Troughout the writing I was
helped by a few people I would like to thank.
First of all my supervisor Keyvan Dehmamy. While help from your supervisor can be expected,
Keyvan went truly above and beyond my expectations, something I am very grateful for. Second I
would like to thank my fellow student Darius Lamochi with who helped me from the start. At last I
want to thank my friends and family who helped me, especially Daan Romp who was so nice to check
my thesis for spelling.
I wish that you will enjoy reading my thesis.
Rens Sturm
June, 2020, Groningen
5
Table of content
Chapter 1: Introduction ......................................................................................................................6
1.1: Relevancy of the problem .........................................................................................................6
1.2: Problem analysis ......................................................................................................................6
1.3: Academic and practical relevance.............................................................................................7
1.4: Originality of the research question ..........................................................................................7
1.5: Outline thesis ...........................................................................................................................8
Chapter 2: Theoretical framework ......................................................................................................9
2.1: Growth in appliance of machine learning .................................................................................9
2.2: Interpretable and non-interpretable models .......................................................................... 10
2.3: Techniques to make uninterpretable models interpretable .................................................... 12
2.4: Appliance of interpretable black box modelling in decision support........................................ 15
2.5: Conclusion ............................................................................................................................. 16
Chapter 3: Research design & Methodology...................................................................................... 19
3.1: Introduction to neural networks ........................................................................................... 19
3.2: Model development & dataset description ............................................................................. 19
3.3: Research method ................................................................................................................... 22
3.4: Data collection ....................................................................................................................... 23
3.5: plan of analysis ....................................................................................................................... 24
3.6: Conclusion ............................................................................................................................. 24
Chapter 4: Data analysis .................................................................................................................... 26
4.1: Sample analysis. ..................................................................................................................... 26
4.2: Reliability, validity, representativity ........................................................................................ 26
4.3: Hypotheses and statistical tests .............................................................................................. 27
4.4: Interpretation ........................................................................................................................ 30
4.5: Conclusion ............................................................................................................................. 31
Chapter 5: Discussion, limitations and recommendations ................................................................. 32
5.1: Reflective discussion on the results ........................................................................................ 32
5.2: Limitations ............................................................................................................................. 34
5.3: Academic and managerial conclusions.................................................................................... 35
5.4: Conclusion ............................................................................................................................. 35
References ........................................................................................................................................ 36
6
Chapter 1: Introduction
In this chapter the research problem is introduced and its theoretical and managerial relevancy
discussed. We will show that the research problem is urgent, important and original.
1.1: Relevancy of the problem
Machine learning, an algorithm that improves by experience (Langley & Simon, 1995), has achieved a
wide level of application. It is, amongst others, used in defense, research, production, air and traffic
control, portfolio selection and decision support (Zuo, 2019). The market for machine learning is
predicted to grow 186% annually until 2024 (Zion market research, 2017). An example that shows the
potential of machine learning decision support is a study where US police chiefs used machine
learning to predict whether agents were at a danger of overreacting at civilians. The final decision to
intervene was left up to the chief, but by using machine learning support accuracy increased by 12%
(Carton et al. 2016). A Bank of England survey showed that two-thirds of finance companies in the UK
use machine learning as decision support (Bank of England, 2019), making machine learning relevant
in the 21st century.
Increased use of machine learning has evoked feelings of unease in consumers (Cio Summits, 2019).
Reasons are that used are that machine learning models often do not give an explanation nor decide
perfectly (Guidotti et al., 2018). In response to the growing feeling of unease, the European Union
has passed legislation that requires, by law, that consumers have a “right to explanation” after
having been subjected to a decision by an automated process (Regulation, 2016). Consequently if a
human agent uses a machine learning model as input for the decision, the model must be explained.
Models of which an explanation can be given are defined as interpretable or white box. Models that
cannot be explained are defined as uninterpretable or black box. Thus due to the growing concern
and growing unease the need for interpretable machine learning models has become urgent.
1.2: Problem analysis Uninterpretable models prevent organizational learning, and carry the risk of having a systematic
error in the data used for development i.e. a training bias (Guidotti et al., 2018). It may be possible
that the model is sacrificing organizational goals for the proxy goal it has been given (Doshi-Velez,
Kim, 2017) or that the model has “cheated” during training by using the wrong features (input data
points) like background or use a racial bias which is captured in the historic data (Guidotti et al.,
2018).
7
A further complication is the absence of consensus about the definition of interpretability. For some
academics and managers mechanical knowledge of the model is sufficient. Others want to know
what input was decisive in the decision-making process (Guidotti et al., 2018). Secondly a discussion
has been conducted on how to measure interpretability (Doshi-Velez, Kim, 2017). Some researchers
have stipulated that increased human accuracy is an important metric to keep in mind (Doshi-Velez,
Kim, 2017) and in some cases even the most important one.
1.3: Academic and practical relevance The challenge of making machine learning interpretable is almost as old as the field of machine
learning itself. In 1983, the first solutions of explaining why the model predicted a certain output
were proposed, laying the foundation for later innovations (Swartout, 1983). Since then, research has
mainly focused on explaining machine learning models. Removing part of the model, changing the
input to see how the output changes or calculating certain values have been proposed in order to
make these uninterpretable black box models understandable (Guidotti et al. 2018). Absent is
research on the effect of interpretation methods on the human decision maker and his/her
accuracy. Considering the use of machine learning for this goal, researching the topic of machine
learning and human accuracy provides an original and relevant research question.
Establishing a correlation between interpretable machine learning and human accuracy would help
establish increased human accuracy as an important metric for interpretable machine learning and
help settle the discussion on how to measure interpretability.
Due to the EU legislation, companies have to use interpretable machine learning for their
procedures, making it all the more relevant. Since companies use machine learning during decision
support (Bank of England, 2019) it is relevant to know whether interpretability helps in this process.
If accuracy increases it could be used by managers to make better decisions in critical situations,
helping the organization reach its goals.
1.4: Originality of the research question In this report we will look at the research question: “What is the influence of an explanation of the
output of a model on the accuracy of the human agent?” Answering this question will add to existing
literature in the following way: much research has been done regarding interpretable machine
learning and machine learning accuracy in autonomous tasks, but little research has been done about
human accuracy in non-autonomous tasks like the early warning methods described at the beginning
of this chapter. A second contribution is that while increased accuracy is proposed as a metric to
8
judge the machine learning model on, it has (to the best of our knowledge) not yet been tested in an
experimental set-up.
1.5: Outline thesis This thesis builds on earlier research and experiments. In chapter two we will discuss the theoretical
framework of the thesis and draw a number of hypotheses from the existing literature on what
seems likely, but is not proven. In the third chapter we will explain how we have set up an
experiment to test the set hypotheses. The experiment will ensure that we have accurate, unbiased
data to work with. The fourth chapter analyses the data collected in the experiment using various
statistical methods. With these methods we can accept or reject each hypotheses. Lastly, in chapter
five, we will discuss the findings and how to move on from there. We will also discuss the limitations
of our research.
9
Chapter 2: Theoretical framework
During this research we will analyze the role of machine learning recommendations on behavioral
decision making. This research uses insights of other researchers as a foundation, framework and
stepping stone. In this chapter we will briefly describe relevant literature about machine learning,
interpretable and non-interpretable machine learning and decision making. Existing literature is used
to formulate hypotheses that will be used for the conceptual framework at the end of this chapter.
These hypotheses will be tested in the next chapter.
2.1: Growth in appliance of machine learning
To understand the relevance of interpretable machine learning we have to discuss the context of
machine learning. In this paragraph we will look at applications of machine learning methods, why it
is popular and distinct between situations where a human agent is the final decision-maker or where
the machine learning algorithm is autonomous. This distinction will demark the theoretical reach of
the research.
In the past few years, machine learning has greatly improved in performance. In 1997 IBM’s Deep
Blue defeated world champion Kasparov in chess which is a structured game with rigid rules
(Greenemeier, 2020). Only fourteen years later, IBM’s Watson defeated two champions at Jeopardy,
a far more unstructured game (Cbsnews.com, 2020). Explanations for improved performance are
that computers have become more powerful, engineers use more complex models and they have
more data to work with (Hodjat, 2015; Hao, 2019). Machine learning has increased in appliance
rapidly, varying from automation: national defense, research, production automation, air and traffic
control and support: portfolio selection and human decision making support (Zuo, 2019). In an
automated role, the machine learning model takes over the task from a human planner. In a support
role, the model supports the human, but does not make decisions for him/her. Machine learning is
also used in a wide range of decision-making support. Healthcare organizations rely on neural
networks as decision support (Shahid, Rappon & Berta, 2019) and investors use neural networks, a
popular method of machine learning, as decision support during portfolio selection and risk
management (Al-Qaheri, Hasan, 2010). As mentioned earlier, two-thirds of finance firms in the UK
use machine learning (Bank of England, 2019). The increase in application is not without reason.
Trained Neural Networks are better in predicting Research and Development (R&D) costs than
humans (Bode, 1998) and a 2019 study showed that neural networks combined with human coaches
performed much better than humans alone (Zuo, 2019), providing clear benefits for the adapters of
machine learning.
10
Proposed explanations of superior performance point at limited human cognitive capacity for
processing new information (Shahid, Rappon & Berta, 2019). Machine learning models can handle
large quantities of complicated information while humans find handling many data points difficult.
Humans often work with small samples of personal experience while machine learning models work
with larger samples (Bode, 1998). Also humans have a limited capacity to mentally calculate
interactions, while especially Neural Networks have less of a problem with this (Shahid, Rappon &
Berta, 2019). Machines can work through large numbers of scenarios, while humans have a tendency
to lock in early (Nickerson, 1998). The basic explanation seems to be that the computational power
of computers is greater than human processing power.
Daniel Kahneman, Nobel laureate and professor of human decision making has famously stated that
he thinks simple algorithms consistently outperform human decision making due to a lack of
sensitivity to noise (Kahneman, Rosenfield, Ghandi & Blaser, 2016). A literature overview done in
1979 showed that there is a consistent superiority in algorithmic decision making over human
decision making (Dawes, 1979). This confirms early stated theories about machine learning
superiority. However, critics point at the fact that historical data may not be representative. Before
the 2008 crisis, house-prices steadily went up causing algorithms to assume that this would forever
happen. However the steady climb caused a financial crash, and house prices decreased sharply at an
unprecedented rate (Trejos, 2007). Other critics point out that humans are better than machine
learning models in some tasks (Thiel & Masters, 2015). Young children have no difficulty in making a
distinction between dogs and cats, while machines do.
Thus, machine learning is getting better at many tasks humans have previously performed.
Automation and decision support are two ways machine learning can add value. Models are better at
handling vast quantities of information and thinking through them, whilst humans outperform
computers in other areas. With these benefits we predict that humans will be more correct and
precise in their predictions (human accuracy). Therefore, we conclude that when machine learning is
used, and a human agent can understand it’s output, human accuracy increases.
2.2: Interpretable and non-interpretable models
While designing a machine learning model, an important decision is whether to make the model
interpretable or not. The decision influences interpretability and trust and we are interested whether
it influences human accuracy.
11
In the introduction we stated that the definition of interpretability is open for discussion (Miller,
2019). In this research we will define the interpretability of the model as the degree to which a
human can understand the cause of a decision (Biran & Cotton, 2017; Miller, 2019). Guidotti et al.
(2018) defined two components of uninterpretable models: 1) Opaqueness and 2) number of parts.
A model is opaque when internal workings cannot be observed. Neural networks are mostly
uninterpretable because internal workings are unobservable, and because they contain many parts.
Models that provide no explanation (also called black boxes) lead users to see the model as less
trustworthy (Ribeiro et al, 2016). Providing an explanation increases the acceptance of movie
recommendations (Herlocker, Konstan, Riedl, 2000). In a 2003 study, participants rated the
recommendation model as much less trustworthy if it gave no recommendation and used it
significantly less if no change was made (Dzindolet et al, 2003). Users report feelings of violated
privacy if black box models make recommendations (Van Doorn & Hoekstra, 2013), and a feeling of
unfairness (Zhang & Bareinboim, 2018).
Second, a black box model might not work as good as one thinks. It happens that a model ‘cheats’ by
focusing on an artificial feature. For example a complicated neural network could accurately predict
whether a tank was of the United States Army or not by recognizing if the photo had clouds or not
(Guidotti et al, 2018). Owned tanks were photographed with good weather, enemy tanks with bad
weather. Another model could recognize whether the animal was a wolf or a husky by spotting snow
in the picture (Guidotti et al, 2018). The reason for this ‘cheating’ is simple, the model does not know
these features are ‘off limits’. If the model could give an explanation, these kind of mistakes could be
spotted and fixed before they cause damage. Statistics on these kind of errors are missing, but the
risk is there.
Lastly, black box models that are trained on historic data can take over undesirable patterns. A
trained model in 2016 predicting the risk of crime recidivism showed a large racial bias against
people of color (Guidotti et al, 2018). Also, in 2016 Amazon’s decision to not offer same-day delivery
in minority neighborhoods was largely influenced by a predictive machine learning model (Letzter,
2020). These are all things that companies want to avoid of course, since the human rights
declaration specifies that treatment should not depend on race ("Universal Declaration of Human
Rights", 2020).
While uninterpretable models carry disadvantages, so do interpretable models. A major
disadvantage of interpretable machine learning models is that they are often simpler, and therefore
less accurate (Fong and Vedaldi, 2017). Second, interpretable models are easy to be replicated,
making them unable to be a sustainable competitive advantage (Guidotti et al, 2018).
12
While there have been few opponents against explainable models, they have been used little (Ribeiro
et al, 2016). Many agree that explainability is good to have but few are willing to offer up accuracy
and a competitive advantage in order to obtain it. The 2016 EU legislation has made it mandatory to
provide an explanation when one is given. What constitutes as an explanation however, is open for
interpretation. This ruling made the need for accurate explainable models, and a consensus on what
an explainable is and how it can be measured is all the more urgent.
In conclusion, uninterpretable models carry some major disadvantages but are still preferred in
practice due to their increased accuracy. The ruling of the EU has made it, at least in the European
Union and its member states, necessary by law to use explainable models in scenarios of decision
making. Due to the opaqueness and multitude of parts we conclude that the more complex a
machine learning model is, the less interpretable a human agent will find the model.
2.3: Techniques to make uninterpretable models interpretable
Previously, the need for accurate interpretable models was specified. Popular solutions try to explain
the model so a human can understand instead of simplifying it, so accuracy and interpretability are
both high. We will look at popular methods, and explore the influence of explanation mechanisms on
trust and understanding. Later, this relation will be used in the conceptual framework.
In the previous paragraphs the distinction between black box and white box models was made. There
is a third option: explainable black box models. These models are opaque and contain many parts but
give as a third step an explanation of what they did to enhance interpretability (Guidotti et al., 2016).
These kind of explanations are not at the expense of accuracy.
Black box models can be explained in many ways. We will look at three popular methods: ablation,
simplification and saliency.
A basic method of understanding black box models is ablation, where parts of the model are
removed to see how the accuracy changes (Xu, Crammer, Schuurmans, 2016). Important parts of the
machine learning model are thus observed. A major downside of ablation is that it is computationally
expensive and does not respond well to scaling (Sundarajan, 2017).
Simplification models try to recreate a simpler version of the original model (Yang et al, 2018). A well-
known method is called pruning, where ineffective parts of the model are cut (Guidotti et al., 2016).
By decreasing the amount of parts in a model, the interpretation goes up, since adding more parts
decreases interpretation (see previous paragraph).
13
Lastly, saliency designs increase and decrease part of the input to see when the model reaches
certain tipping points (Sundarajan, 2017; Guidotti et al., 2016). A popular method is the Integrated
Gradients method, which scales a pixel from 0% strength to 100% to determine the slope and the
optimal point of the slope. By only turning the pixels on which are necessary for the model to
understand the picture, it becomes more clear for a human why the model recognizes the picture as
such. An example is shown in figure 1 and 2 below.
Figure 1 and 2: a picture of a turtle analyzed with Integrated Gradients (Sundarajan, 2017).
A second popular saliency method, which we will use in our research, is the Local Interpretable
Model-agnostic Explanations (LIME) approach (Guidotti et al., 2016). The LIME-technique changes
the input of the model marginally and sees how the output of the model differs, thus deducing the
influence of the input-variables. An advantage of the LIME-technique is that it can be used for
multiple machine learning methods, the method is so-called model agnostic, and is therefore flexible
in use.
Saliency methods do not explain the inner workings of the machine learning model (Guidotti et al.,
2016), they merely explain the relationship between the input of the model and the output (Ribeiro,
Singh, Geustrin, 2016). This means that some aspects remain unobservable, and thus their problems
will remain.
As described earlier, withholding an explanation when one is expected, invokes feelings of mistrust,
violated privacy and unfairness (Van Doorn & Hoekstra, 2013; Zhang & Bareinboim, 2018). Second, it
is observed in the social sciences that giving an explanation increases compliance and trust even if
the explanation is not logically sound (Cialdini, 2014). A multitude of studies find that giving an
explanation increases trust and understanding (Kim et al, 2016; Gkatzia et al, 2016; Biran and
McKeown, 2017). Showing which features were most important in making the prediction can aid a
14
human to understand why the decision was made thus increase interpretability. However, It should
be noted that a 2019 study found that giving a good explanation does indeed increase trust and
understanding, but giving a bad explanation decreases trust (Papenmeier, Englebienne, Seifert,
2019). The sample size (N=327) of this study was large enough to be reliable. If the sample size would
have been small, the contradictory finding could have been a false negative. It should be noted that
this finding has, to the best of our knowledge, not yet replicated by the authors themselves or third
parties. It is still possible that the study is a false negative. If the study is replicable then a difference
in study setup or an underlying nuance is likely.
Why is an interpretable model more trustworthy than an unpredictable one? A very basic
explanation can be found in evolutionary psychology: the unknown carries risk, and we are risk-
averse (Zhang, Brennan, Lo, 2014). If we delve deeper, we find additional theories: providing an
explanation gives the participant the option to assess the fitness of the machine for that particular
case. In a 2016 study, researchers found that participants rate models as more trustworthy even
though they make faults, if the human has the possibility to change the model (Ribeiro, Singh,
Geustrin, 2016). However, it should be noted that the sample size for that particular study was rather
small (N=100). Other studies do find that explanations increase trustworthiness if the human has the
possibility to deviate from the final recommendation (Kim et al, 2016; Gkatzia et al, 2016; Biran and
McKeown, 2017). A second basic explanation is that an explanation provides the human actor with
the possibility to examine internal logic. It is well established that computer programs do not follow
‘common sense’ and make ‘silly mistakes’ (Ribeiro, Singh, Geustrin, 2016).
Providing an explanation thus seems to increase interpretability and trustworthiness, even though
there are some contrary findings. When an explanation is given, it becomes clearer why the model
gave a certain prediction. Even though the internal workings of the machine, nor the interaction
between parts however are not explained. Hence we draw the following hypotheses:
H1: When a machine learning model contains an explanatory mechanism, the human agent increases
in accuracy.
H2: When a machine learning model contains an explanatory mechanism, the human agent sees the
machine learning model as more trustworthy
H3: When a machine learning model contains an explanatory mechanism, the human agent sees the
machine learning model as more interpretable.
15
2.4: Appliance of interpretable black box modelling in decision support
Next we will discuss the potential of interpretable black box modelling. We will describe the
expected relation between interpretable models and human accuracy.
As we have seen before, machine learning can often predict better than humans. Judges often have
to decide whether a suspected criminal can wait out his trial at home (often by posing a bail) or in jail
if he/she is a flight risk or expected to commit crimes while at home. A 2017 study developed an
algorithm that could decide better, and thus reduce jail rates by 42% without increasing crime.
Human judges were too overwhelmed by ‘noise’, unrelated information (Kleinberg et al., 2017). The
difference opened up the question: If a machine learning model judges criminals better than human
judges, shouldn’t we leave the decision up to the bots? Opponents stipulate that this would be
unwise and even unethical (Bostrom, Eliezer, 2011). Algorithms do not understand values like
fairness and liberty, nor should we subject ourselves to computers we don’t understand. Even the
researchers themselves were hesitant to replace all human judges by computers since their study
wasn’t replicated and errors could be there (Kleinberg et al. 2017). However, they were optimistic.
Kleinberg et al. however do not specify what their recommendation is for scenarios where a model A.
does not have (sufficient) data to work with, B. doesn’t understand import variables (like certain
values) or C. where humans are better in part of the job. Hence there are limitations to when
machine learning is better than a human. In situations where the human nor the machine learning
model is superior in all the features used for prediction, there is an opportunity for synergy.
A third strand of the argument proposes a middle way. Do not replace humans with algorithms but
augment them (Rubin, 2019). This way the human and machine both work at what they are best at.
In 2017 Doshi-Velez and Kim proposed that we should judge decision-aiding models on whether they
help the decision-maker or not since this is close to the goal of the model. Using this middle way
would allow the human agent to control for variables the model does not understand, or does not
work well with. It would also simplify who is responsible if the decision does not pan out. Out of
these benefits we confirm our first hypothesis that interpretable machine learning increases accuracy
in the human agent.
The characteristics of the human agent are of influence which of these three explanations, if any, will
be found to be true. Certain biological aspects like age and gender are of importance during the
decision making process. Secondly, previous experiences or early exposure to certain variable can
influence the decision-making process. We will look at these two variables in greater detail.
Age has been found to be relevant to decision making in multiple studies. A 1975 study found that
the age of a manager influences the accuracy of decision making process significantly in a positive
16
way (Taylor, 1975). While the claim has been made that age has a negative effect on the ability to
deal with new technologies, like machine learning, the study found no empirical evidence to back this
claim up (Taylor, 1975). Other studies found elderly participants to be more risk-averse than younger
participants (Cauffman et al., 2010). The researchers found this effect to be consistent in cases where
it was advisable to be risk-averse as when it was not in their interest (Cauffman et al., 2010). Elder
people however have found to be more rigid in their thinking due to a decline in fluid cognitive
ability, i.e. the ability to reason and problem solve without relying on previous experiences (de Bruin
et al., 2010). Based on these findings we can conclude that if one classifies machine learning as a new
technology, it is likely that elder participants find it harder to work with it. This conclusion needs yet
to be tested.
Hence we draw the following hypothesis: H4: the age of the human agent influences the causal link
between machine learning support and human agent accuracy negatively.
The influence of gender on decision-making, especially in managerial positions, has been a
controversial topic for centuries. While women are, on average, underrepresented in politics
("Women in Parliaments", 2020) and in business executive positions ("Female Business Leaders",
2020) there does not seem to be a convincing biological factor affecting decision making. Women do
seem to be more risk-averse than men, regardless of the level of ambiguity (Powell & Ansic, 1997).
Secondly women do score higher on the personality ranking of agreeableness (Chapman et al., 2007).
This may mean that women are more likely to follow the advice of the machine learning model.
Hence we draw the following hypothesis: H5: the gender of the human agent influences the causal
link between machine learning support and human agent accuracy.
Age and experience are marginally correlated, but not per se the same thing. A manager of sixty
years old using a machine learning model for the first time is aged, but not experienced (in machine
learning models). While it seems common knowledge that extended experience causes superior
performance, it may not be the case. In a study under physicians, the researchers found no
correlation between age and performance (Ericsson, 2006). The same researchers found cases where
experience leads to worse performance due to overconfidence (Ericsson, 2004). Other studies have
indeed found no link between experience and performance (Ericsson & Lehmann, 1996). Thus we
expect to find no link between experience and decision accuracy.
2.5: Conclusion
From these hypotheses we draw the following conceptual model: complexity (caused by opaqueness
and the number of parts) cause the model to be less interpretable. Interpretability positively
17
influences accuracy. For the human agent to work well with the machine learning model, he or she
must trust the model to be accurate and capture the whole question. The level of trust positively
influences the accuracy of the human agent. Adding an explanatory mechanism influences two
relationships in the conceptual framework. First of all it mellows the impact of the complexity of the
model on the interpretability of the model. It makes the relationship less negative. An interpretable
model causes the human agent to be more accurate because he/she can fill in the gaps the algorithm
does not understand. Adding an explanatory mechanism moderates the effect. Second the
explanatory mechanism increases trust. By understanding the model better the human agent can
judge the model more accurate which increases trust. For an overview, see figure 3.
Figure 3: conceptual framework
The conceptual framework crystalizes five hypotheses:
H1: When a machine learning model contains an explanatory mechanism, the human agent increases
in accuracy.
H2: When a machine learning model contains an explanatory mechanism, the human agent sees the
machine learning model as more trustworthy
H3: When a machine learning model contains an explanatory mechanism, the human agent sees the
machine learning model as more interpretable.
18
H4: the age of the human agent influences the causal link between machine learning support and
human agent accuracy negatively.
H5: the gender of the human agent influences the causal link between machine learning support and
human agent accuracy.
In the next chapter called “Research design and Methodology” an experiment will be set up to test
each of these hypotheses. Chapter four will use this setup to analyze the collected data and conclude
whether these hypotheses are true or not.
19
Chapter 3: Research design & Methodology
To research whether interpretability influences accuracy we have set up an experiment since it has
(to the best of our knowledge) not been researched before. The experiment will take place in 3 steps:
1. The development of a neural network that can predict the survival of the passengers
2. Taking of a survey where the availability of an explanatory mechanism is manipulated
3. Analysis of the data
Before elaborating on the design of the experiment we would like to give a brief summary about the
nature of a neural network, the chosen type of machine learning algorithm used for the experiment.
This is not explained in the literature overview since it does not take part in the conceptual
framework, though a basic understanding can help in the understanding of the research design. If
one is already familiar with neural networks, this part can be skipped.
3.1: Introduction to neural networks
A neural network is a type of machine learning model. The task of a
neural network is to translate the input/independent variable (IV’s)
into an output/dependent variable (DV) (Goodfellow et al., 2017).
The neural network does this by sending the input to the hidden
layer (see figure 4) that increases or decreases the input by a
certain amount. The new value gets then forwarded to another
hidden layer, or an output layer if the model performs satisfactory.
In the beginning the model will not know exactly by how much to
increase of decrease the values. This is done during the training-
stage of the model. The designers of the model give the neural network a long list of IV’s together
with the DV”s. The model adjusts a little for each time it gets the question wrong, which is called
back-propagation (Goodfellow et al., 2017). This is repeated until the model cannot improve any
further. When the model is trained adequately the model is tested using a new list of questions
where the answers are not provided.
3.2: Model development & dataset description
First of all we developed a neural network (NN) that predicts whether a certain passenger of the
Titanic has survived or not by analyzing their cabin class, gender and the presence of family. This NN
20
is trained on a Kaggle-dataset. After development we made the model interpretable by adding a
Lime-explanator.
In order to develop a good model we need good data. The dataset used is provided by Kaggle, an
online learning community for data science and machine learning (TechCrunch, 2020). The Titanic
dataset used comes from historical data. The dataset is used by scholars, for example in this paper
(Chatterjee, 2018) and is verified by a Titanic remembrance group (non-profit) (Titanic Survivors,
2020). The training dataset contains 891 observations of 12 variables.
Due to the completeness of the dataset no observations were removed. In order to predict survival a
few variables were removed. The ticket-code, name, cabin name, age, fare and embarking location
showed little correspondence with the survival rate or showed incompleteness. Only the
independent variables class, Sex (gender), the presence of siblings or a spouse and the presence of a
partner or child was maintained. Also the dependent variable, whether the individual has survived,
was maintained. The presence of siblings, spouses, partners or child was transformed to a binary
variable were they were either present or not present in order to keep the model simple for the
participants as well. See table one for the data-types used after the transformation of the data.
Variable Explanation Data type
Class The class of the cabin in which the passenger
in question stayed during his/her voyage on
the Titanic.
Factor with 3 levels: First class,
Second class, Third class
Gender The gender of the passenger Factor with 2 levels: Male, Female
SibSp The presence of a sibling or spouse on the ship
on the moment of sinking
Factor with 2 levels: present or not
present
ParCh The presence of a partner or a child on the
ship
Factor with 2 levels: present or not
present
Survived Whether the passenger survived. This is the
dependent variable of the machine learning
model
Factor with 2 levels: did survive or
did not survive
Table 1: independent variables of the neural network and their data type
The neural network itself was kept simple with five layers. An input layer, three layers of 300 nodes
and the output layer had only one node (survival). The model was kept simple because it produced
adequate results already. The model was trained in 15 rounds. After each round (also called an
epoch) the model used back-propagation to adjust the model for more accuracy. After training the
21
model showed an in-sample training accuracy of 82.24 percent, and an out-sample accuracy 81.25%.
This was rounded down to 80% accuracy when the model was presented in a question in the survey.
The input layer contained four nodes for each of the input variables. These values were forwarded to
the three hidden layers.
The hidden layers used 300 nodes, together with an ‘uniform’ weight initialization and an Adam-
optimizer. The weight initialization causes the model to converge faster and therefore reach faster
optimal accuracy. The ‘uniform’ weight initialization assumes that all weight in the dataset follow the
same (uniform) distribution meaning that all weights have the same value (Thimm & Fiesler, 1997).
This was not entirely the case in our dataset. However since the uniform weight initialization is the
most neutral it is hard to get a wrong model using that specific weight initialization. Seeing that the
accuracy of the neural network was fairly high, even though the initializer was suboptimal, it was
decided that the initializer worked satisfactory. As an optimizer for the neural network the “relu”
function was chosen. The “relu”-optimizer is known to be simple and not too taxing for the computer
system running it (Alrefaei & Andradóttir, 2005). It is widely used in other scientific papers (Jiang et
al., 2018).
Lastly the output layer transformed the values of the hidden layers into a single value, the probability
of that passenger surviving. The output layer again used a “uniform” initialization and used an
“sigmoid” optimizer which translated the value into a probability which can be used by the human
agent. The model was compiled using an “Adam” optimizer. The “Adam” function works very simple.
If a node receives a value of less than zero, it pays forward a value of zero (Kingma & Ba, 2014). The
loss-function was binary cross-entropy.
At last the LIME-explanatory mechanism was added, using the variables class, gender, the presence
of siblings or a spouse and the presence of a partner or child as an explanation of the neural network
for making a certain prediction. To make the labels more interpretable they have later been manually
changed to a more interpretable version. The whole process was executed in the digital environment
of Google Collaborate. To code the LIME-explanatory mechanism we have looked at a guide on
Medium (Dataman, 2020). To code the neural network we have drawn heavily from the lessons
presented in the Datacamp-course “introduction to deep learning with Keras” and “advanced deep
learning with Keras”. Datacamp is a paid digital learning environment paid for by the university of
Groningen. For the full code used to program the model, please advise appendix one.
22
3.3: Research method
To test the hypothesis we conducted a survey. First of all participants received a general introduction
together with some base statistics about the survival rate on the Titanic. The general accuracy of the
machine learning model was also presented in order for participants to make a fully informed
decision.
Second the participants were randomized between two groups: the control group simply received a
machine learning recommendation and the treatment group received a machine learning
recommendation with a LIME-explanatory mechanism. The treatment group received an additional
paragraph, explaining the LIME-mechanism and how to interpret it. Each group got to see eight
historical passengers together with their features and a prediction of the neural network on whether
they had survived the Titanic disaster or not. The participants then had to decide if they thought
these passengers had survived or not. These questions could not be skipped and there was no none-
option in order to make sure we also collect data on fringe cases where there is no clear answer.
For an example of a question for the control group, see figure 5.
Figure 5: example question control group
For an example of a question of the treatment group, see figure 6.
Figure 6: example question treatment group
Both groups received the same amount of passengers with the same attributes. We made sure that
the questions were balanced and uncorrelated with each other using the guidelines set out for a
23
conjoint analysis. We balanced the attributes of the passengers like class and children onboard to
make sure no correlations could be found. The second group received with the neural network
prediction also an interpretation of the model with the LIME-explanatory mechanism. After each
prediction participants of both groups were asked to judge the interpretability of the model.
The survey closes with some general demographic information that participants were allowed to skip
if they preferred. These general questions contained an attention check which tests whether
participants are actually reading the questions or mindlessly filling in the blanks. Those failing the
attention check were excluded from the results. No reward was given (or offered) for those that
completed the survey. For the full survey see appendix three.
3.4: Data collection
During the survey information was gathered on five variables which correspond with the conceptual
model. In table two these five are described together with how they were asked and how they were
measured.
Variable Survey Question Measurement
human accuracy “Has the passenger survived the Titanic?” Percentage correct
Interpretability “Do you understand why the neural network
has given this prediction?”
Likert-scale
Trust “How much would you trust this model to
make decisions in real-life”?
Likert-scale
Explanatory
mechanism
Whether there is an explanation (1) or not
(0)
NA
Expertise “Have you read or watched any non-fiction
books/movies about the Titanic except for
the movie Titanic made in 1997, starring
Leonardo DiCaprio?”
Yes/no
Table 2: Independent variables conceptual framework and measurement method
For the results to be reliable the sample size needs to be large enough to compensate for outliers
(Fischer & Julsing, 2019). We aimed to collect 150 responses to our survey. However we only
24
managed to collect 122 responses which limited our analysis somewhat. Second the sample needs to
be representative with people coming from all layers of the population. Age, gender and education
will be recorded before in order to monitor this. An increased share of one group is expected since
randomized data shows patterns.
Participants were recruited from the personal network of the author, as well as his acquaintances. A
secondary source of participants will come from Facebook and reddit groups for survey collection.
There will be no monetary incentive for participating in the research.
3.5: Plan of analysis After executing the experiment we will analyze the data. For the first three hypotheses we compare
the test-sample to the control-sample. For this we will need to use a multiple regression. For the
influence of consumer characteristics we will use a multiple regression. In order to calculate the
accuracy of the participants we will first calculate a test-score by comparing the correct answers to
the given answers. For an overview see table three.
Hypothesis Statistical method
H1: When a machine learning model contains an explanatory
mechanism the human agent increases in accuracy.
Multiple regression
H2: When a machine learning model contains an explanatory
mechanism the human agent sees the machine learning model as
more trustworthy
Multiple regression
H3: When a machine learning model contains an explanatory
mechanism the human agent sees the machine learning model as
more interpretable.
Multiple regression
H4: the age of the human agent influences the causal link between
machine learning support and human agent accuracy negatively.
Multiple regression
H5: the gender of the human agent influences the causal link
between machine learning support and human agent accuracy.
Multiple regression
Table 3: hypotheses and statistical methods
3.6: Conclusion
In this chapter we set up an experiment to collect data on the five variables described above.
Furthermore a description of model parameters was provided, together with a description of the
25
process in which the model is developed. With the experiment we will collect data needed to test the
hypotheses described in the chapter “literature overview”. In the next chapter we will analyze the
collected data in order to conclude whether the hypotheses can be accepted or rejected.
26
Chapter 4: Data analysis
In this chapter we will analyze that data collected from the experiment described in the previous
chapter to confirm or reject the hypotheses stated before. For this we will use various statistical
methods. We will explain how the process of analysis has been executed for maximum transparency.
We will end with an interpretation of the results. For the full code used for the analysis, please advise
appendix two.
4.1: Sample analysis. On the 5th of May data collection started, it was ended on the 15th of May, collecting 145 responses.
The data was exported to a comma-separated file (CSV). After rejecting participants who have failed
the attention check (including those that quit halfway) 122 respondents remained.
To calculate accuracy each estimation was checked with the correct answer and the answer was
deemed TRUE (correct) or FALSE (incorrect). The mean of these answers, where TRUE = 1 and FALSE
= 0 gave us a score. The highest score was 87.5%, the lowest 12.5%. Of the participants 55 (45%) was
female and 67 (55%) was male. The average age was 29 years. Most (107 participants) are Dutch,
with 7 Germans, 3 Belgian and 5 participants of other nationalities. 59% (73 participants) stated to
have never read or watched anything about the Titanic, excluding the popular 1997 movie. 41% (49
participants) did.
There were some technical problems. Due to an editing error a question to the control-group (which
did not receive a LIME-explanator) referred to the LIME model which was not there. One participants
complained about this. A second participant found the question to be vague. She asks: “Do you mean
whether I understand the model or whether I agree with it?” No other participants complained about
this, so it is possible that it was a limited confusion. Lastly we collected less responses than expected.
4.2: Reliability, validity, representativity For a good analysis the sample of the population who did the experiment needs to be representative,
valid and reliable (Fischer & Julsing, 2019). As we saw before the gender division in the sample is
roughly equal. There are a few more women than men, but not enough to significantly skew the
population.
As is shown in figure seven, the sample does have a significant distortion in favor of the younger
participants. The bulk of the participants is between twenty and forty years old, with only a few older
than forty years old. There are also few participants younger than fifteen years old, but since it is
27
unexpected that they will use machine learning models for decision support, it is not distorting the
sample. This previously mentioned distortion is not good. When we look at ethnicity we see another
distortion. The vast majority of participants is Dutch, which means not all ethnicities are well-
represented. This is a limitation of the study.
Figure 7: age distribution of the sample
Due to the fact that the real answers were not shown to the participant no maturation or learning
effect could take place ensuring internal validity (Nair, 2009). The chance that the findings are an
outlier, and more testing would lead to less extreme scores (also called regression to the mean) is a
possibility though unlikely seen the high significance as described in the next paragraph. Participants
were divided at random, ensuring no systematic group bias occurred. Lastly the drop-out rate of the
participants is worrying. It can not be established whether the twenty-three participants who
dropped out were the result of a systemic flaw or randomly dropped out. This may prove to be a
threat to the validity of the research and needs to be studied in any replication study.
4.3: Hypotheses and statistical tests
The first hypothesis states that participants with a LIME-explanator are more accurate (when a
machine learning model contains an explanatory mechanism the human agent increases in accuracy).
The average accuracy for the treatment group with LIME is higher (M = 0.64, SD = 0.115) than for the
control group (M = 0.566, SD = 0.172). To control for the effect of participant characteristics, see
hypothesis four and five, we performed a multiple regression. We found that the LIME-explanatory
mechanism had a significant, positive effect (B = 0.1037, p = 0.003). We can therefore conclude that
having an explanatory mechanism increases accuracy significantly. For the full results, please consult
table 4.
28
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.215e-01 4.109e-02 15.125 < 2e-16 ***
LIME 1.037e-01 2.793 e-02 3.712 0.0003 ***
Age -7.766e-05 1.104 e-03 -0.070 0.9440
Gender -6.903-02 2.832 e-02 -2.438 0.0167 *
Knowledge on
the Titanic -7.027e-02 3.505 e-02 -2.005 0.0473 *
Experience with
machine learning -2,572e-03 3.940 e-02 -0.065 0.9481
Table 4: the influence of LIME-explanatory mechanism on accuracy when controlling for consumer
characteristics
The second hypothesis states that participants with a LIME-explanator trust the model more (when a
machine learning model contains an explanatory mechanism the human agent sees the machine
learning model as more trustworthy). Again we performed a multiple regression to control for
consumer characteristics. We found that the treatment group (M=3.81, SD=0.661) sees the machine
learning model as significantly more trustworthy (B=0.8955, p = 8.8e-08) than the control group
(M=2.78, SD=0.917). Hence we can conclude that having a LIME-explanatory mechanism increases
trust significantly. There is no relation between the age of the participant and the trust they place in
the machine learning model. We concluded so after having done a Pearson-correlation (r(120) = -
0.93, p = 0.352).
Variable Estimate Standard Error T-value P-value
Intercept 2.5885 0.2307 11.222 < 2e-16
LIME 0.8955 0.1568 5.711 8.8e-08
Age 0.0021 0.0062 0.344 0.7317
Gender 0.2834 0.1589 1.783 0.0772
Knowledge on the
Titanic -0.2883 0.1967 -1.466 0.1455
Experience with
machine learning 0.4098 0.2212 1.853 0.0664
Table 5: the influence of LIME-explanatory mechanism on trust when controlling for consumer
characteristics
29
The third hypothesis states that participants with a LIME-explanator see the model as more
interpretable (when a machine learning model contains an explanatory mechanism the human agent
sees the machine learning model as more interpretable). After performing a multiple regression we
find that the treatment group (M=3.63 , SD=0.484) sees the machine learning model as significantly
more interpretable (B=0.3966, p = 0.0003) than the control group (M = 3.25, SD=0.568). Thus we can
conclude that having a LIME-explanatory mechanism increases interpretability of the machine
learning model significantly.
Variable Estimate Standard Error T-value P-value
Intercept 3.1764 0.1563 20.305 2e-16
LIME 0.3966 0.1062 3.733 0.0003
Age 0.0033 0.0042 0.785 0.4343
Gender -0.0014 0.1077 -0.013 0.9895
Knowledge on the
Titanic -0.1866 0.1333 -1.400 0.1642
Experience with
machine learning 0.1250 0.1499 0.834 0.4059
Table 6: the influence of LIME-explanatory mechanism on interpretability when controlling for
consumer characteristics
Lastly we looked at the influence of the participants characteristics (the age of the human agent
influences the causal link between machine learning support and human agent accuracy negatively)
and (the gender of the human agent influences the causal link between machine learning support and
human agent accuracy). We used a multiple linear regression to predict the accuracy of the
participant by their age, gender, whether they had any expertise on the Titanic and whether they had
ever worked with machine learning methods before. A significant link between these variables was
found F(4, 117) = 2.882, p = 0.028, with an R2 of 0.09. After controlling for the number of variables in
the model we found an adjusted R2 of 0.0568. We find that men are slightly worse at predicting (β = -
0.06, p = 0.033) than women. Having read or watched non-fiction about the Titanic decreased
accuracy (β = -0.08, p = 0.042). Age (β = -0.68, p = 0.475) nor experience with machine learning (β = -
0.04, p = 0.345) influenced accuracy significantly (see table 7).
30
Variable Estimate Standard Error T-value P-value
Intercept 0.6768 0.0403 16.781 <2e-16
Age -0.0008 0.0011 -0.717 0.4748
Gender -0.0643 0.0298 -2.159 0.0329
Knowledge on the
Titanic -0.0759 0.0369 -2.059 0.0417
Experience with
machine learning 0.0378 0.0399 0.949 0.3448
Table 7: Estimation of participant characteristics on the level of accuracy
With an ANOVA-test we tested whether participants rate their own gender as more likely to survive
than passengers of the other gender. We found that while female participants rate it more likely (F(1,
120) = 5.882, p = 0.017) that female passengers survived, male participants did not rate it more likely
(F(1, 120) = 1.019, p = 0.315 that male passenger survived. Reasons for this will be explained in the
discussion.
4.4: Interpretation
After having done the statistical tests, and having rejected or accepted all of the hypotheses
formulated earlier, we can draw some conclusions of the influence of the LIME-explanatory
mechanism on the decision-making-process.
First of all we can conclude that having a LIME-explanator has a significant positive effect on the
interpretability of the neural network. Participants who got a LIME-explanation rated the machine
learning model as significantly more interpretable than participants who did not.
Second with all other variables being held we can also conclude that a LIME-explanator has a positive
effect on the trust in the model. In the literature overview (chapter two, paragraph three) we already
gathered some explanations on why this would happen and after collecting the data and using
statistical methods we can conclude that providing a LIME-explanation mechanism have a significant
and positive effect on the trust one places in the machine learning model.
Thirdly a central question of the research was whether providing a LIME-explanation increases the
accuracy of the Human Agent in making decisions and predictions. After analysis we can conclude
that it does. We do not however know for certain why there is a positive effect, but we can conclude
31
with a fairly high degree of confidence that having a LIME-explanator does make participants more
accurate than participants without a LIME-explanator. It should be noted that the second group,
even though they didn’t receive a LIME-explanator, did receive the same machine learning model
with the same level of accuracy.
Lastly we concluded that women are slightly more accurate than men. We also found that women
are more optimistically than men in predicting their own gender. In the next chapter we will describe
why we do not think that this is a conclusion we can generalize to other cases. We will therefore
refrain from interpreting these results.
4.5: Conclusion
After having gathered and analyzed the data we have concluded that providing a LIME-explanator
positively influences the trust a participant places in the machine learning model. We also concluded
that the LIME-explanator increases interpretability significantly, and it makes the participants
significantly more accurate. In the next chapter we will look at the extent to which we can generalize
these findings.
32
Chapter 5: Discussion, limitations and recommendations
In the previous chapters hypotheses were formulated and tested. We have accepted some of the
hypotheses and rejected others. In this chapter we will discuss the findings, the limitations of the
research performed and conclusions that we can draw from this research, for managers as well as
academics.
5.1: Reflective discussion on the results
During this research we have investigated the influence of an explanation on trust as well as
accuracy. As we saw in the previous chapter we can conclude that the amount of trust an average
participant has in the machine learning model to predict accurately increases significantly if an
explanation is provided. Participants also become more accurate in predicting the survival rates of
passengers when they receive an explanation. The exact cause of the increase in accuracy remains
uncertain. The increase in accuracy can be explained in two different ways, which both circle back to
the literature overview of chapter two.
1) The Human Replacement-explanation: a possible explanation is that an increase in trust causes
participants to rely more on the machine learning model than on their own intuition. Since
algorithms are in general more accurate than human judgement (Dawes, 1979) this could cause the
increase in accuracy.
2) The different-viewpoint-explanation: participants that received the LIME-explanatory mechanism
viewed their model as significantly more interpretable. They understood the reason why the model
predicted a certain outcome and due to this understanding had an opportunity to overrule the
model. If the neural network predicted that Paul Chevre would survive because he traveled first-
class, the participant may have an extra piece of information. Perhaps he or she has visited the grave
of Paul Chevre, or has seen a documentary where he offered up his seat to a child. Without the LIME-
explanation it would be unclear what information has been taken into account whilst making the
prediction.
In the literature overview we have condensed a few reasons on why it might be preferable to have a
human make the end-decision. He can be held accountable, can’t ‘cheat’, and understands the
difference between a proxy-goal and the end-goal (Doshi-Velez, Kim, 2017). The Human
Replacement-explanation takes a dim view on this perspective. If a human becomes more accurate
when he relies more on the model, can he become more accurate by relying totally on the model?
And if he does that, what is the use of the human decisionmaker then? If the different-viewpoint
33
explanation is true companies should do the opposite, and let their most experienced employees
work with the machine learning models since they are the reason for the superior performance.
The clash between these two ideas show the importance of further research in the field of
interpretable machine learning. From the gathered data we might get the impression that knowledge
about the Titanic has a negative influence on accuracy since the coefficient was negative (-0.076) and
significant. However it should be kept in mind that having read a book or seen a documentary about
the Titanic hardly provides the detailed knowledge needed for this survey. On the other hand, the
research showed that human agents do become more accurate when using interpretable machine
learning model. This is confirmatory evidence for the previously mentioned strand of literature that
states that humans can benefit from working together with machine learning models (Zuo, 2019;
Doshi-Velez, Kim, 2017). More research is needed to provide a definitive answer on where this
increased accuracy comes from.
A second finding to discuss is the fact that female participants rate female passengers on the Titanic
as more probable to survive than male participants rate female passengers to survive. On the other
hand, but male participants do not expect male passengers as more probable to survive than female
participants rate male passangers. There are two psychological causes that can explain this
phenomena: the availability bias and anchoring.
The availability bias is the mental tendency to correlate the availability of a memory with the
probability of it happening (Kidd et al., 1983). A popular example is that we estimate the chance of
death by a terrorism attack to be more likely than death by a car-accident even though the latter is a
thousand times more likely ("What do Americans fear?", 2020). A terrorism event is very memorable
and will therefore be very available. The availability bias can explain why women see other women as
more likely to survive. In the popular 1997 movie about the Titanic by James Cameron the male
protagonist dies and the female counterpart survives. Since the movie is more popular under women
it is likely that they are more affected by the availability bias than men (Todd, 2014).
A second explanation is the anchoring bias. When placed in a new situation the first data-points and
decisions guide us trough the rest of the process (Aronson et al., 2017). In both versions of the survey
a female passenger is introduced first. In the non-LIME variant it is Nelle Stevenson with a predicted
survival rate of 98%, in the LIME variant it is Carrie Toogood with a survival rate of 97%. These two
observations are both female with a very high chance of survival. It is possible that participants have
used these observations as an anchor that women are likely to survive, and that this effect has spilled
34
over in the decision process. If so, the effect will disappear if a replication study uses a more neutral
first question.
5.2: Limitations
During the research process certain decisions were made which limit the findings and the
generalizability of the conclusions made in the previous chapter. For clarity we have grouped these
limitations in two categories: limitations regarding research design and other limitations.
First of all participants were asked a closed question with only two options: yes or no. These type of
decisions happen in real life regularly (Fischhoff, 1996; Nutt, 1993), though more open decisions also
occur and are not represented in the survey. The generalization of the influence of explanatory
mechanism on open-ended decisions is uncertain at best.
Second the research had a high drop-out rate. 23 of the 145 participants dropped out, almost 16%. It
is likely that this drop-out is random since both the control group and the treatment group are of the
same size. However we cannot know for sure and the dropout rate should be kept in mind if the
experiment is replicated. The high drop-out rate also caused a low response rate of only 122
respondents, well below the set target of a 150 respondents. While the hypotheses could still be
tested with a high level of significance a replication with a larger sample size is preferable.
While the size of the sample group is one thing, the lack of diversity is another source of concern. As
stated in chapter four the sample is mainly Dutch and young and does not represent the sample as a
whole very well. Differences in age might therefore not be adequately interpreted.
Third is the generalizability of conclusions to other types of machine learning models or explanation
mechanisms. During the research we have chosen for a neural network since it is opaque and
complicated. When a decision tree was used, which is another form of machine learning, as the
model results could have differed. The used neural network was also simple in setup, as more
complicated neural networks capture deeper patterns it remains the question whether LIME-
explanators can capture this increased complexity in its recommendation. It is hard to make a
prediction on this, hence more research is needed.
Lastly participants were asked to make predictions about a sample of Titanic passengers of which the
most relevant parameters and the distribution of the survival-rate per parameter was known. This
may not be applicable to all scenario’s. In predicting the success of a start-up not all relevant
parameters are known, nor the distribution of success per parameter. Whether an explanatory
mechanism helps in this scenario depends on the causal mechanism described earlier.
35
5.3: Academic and managerial conclusions
Interpretable machine learning offers opportunities for companies in ways we are yet to learn fully.
In general we can conclude upon this research that it is recommendable for companies which already
use machine learning to make their algorithms interpretable for their employees and customers.
Having interpretable machine learning models would mean that their employees are more able in
decision-making and prevent mistakes from happening that computers simply miss. Second, by
making their machine learning models more interpretable it creates more trust between customer
and the company than if the model was uninterpretable which can evoke negative feelings. Lastly
new EU-regulation requires companies to provide an explanation to their customers when requested
anyway. Complying with this new legislation is without doubt a high priority for most firms.
For academics more research is needed on the influence of diverse machine learning methods and
explanatory mechanisms on human accuracy. In the research executed we have used one machine
learning method (a neural network) and only one explanatory mechanism. It could very well be that
other machine learning methods in combination with other explanatory mechanisms would deviate
from the findings found here, and that therefore the conclusions drawn here cannot be generalized.
Secondly the effectiveness of interpretable machine learning model decision support is relatively
untested in more unpredictable environments where not all the facts nor all the important variables
are known. During the experiment the number of outcomes was fairly restricted (survived/did not
survive) and the relevant variables on which to make a decision were known. Many decisions made
are not as structured but do carry much importance and relevance for everyday life. While the
importance is great, little research has been done and it provides a rare opportunity.
5.4: Conclusion
After the research a fundamental question remains: do humans and algorithms perform better
together or in isolation? Even though we have not answered this question definitively we have
provided an answer on how they can work better together. Next we have shown why same-gender-
optimism may have been a design flaw of the experiment. We have limited our conclusions to fit the
experiment done, and we have concluded with some general recommendations for managers and
ideas for future research.
36
References
Al-Qaheri, H. Hasan, M. (2010). An End-User Decision Support System for Portfolio Selection: A Goal
Programming Approach with an Application to Kuwait Stock Exchange (KSE). International Journal of
Computer Information Systems and Industrial Management Applications.
Alrefaei, M., & Andradóttir, S. (2005). Discrete stochastic optimization using variants of the stochastic
ruler method. Naval Research Logistics (NRL), 52(4), 344-360. https://doi.org/10.1002/nav.20080
Aronson, E., Wilson, T., Fehr, B., & Sommers, S. (2017). Social psychology (9th ed.). Pearson.
B. Kim, J. Shah, F. Doshi-Velez (2015). Mind the gap: A generative approach to interpretable feature
selection and extraction. NIPS
Bank of England. (2019). Machine learning in UK financial services. London: Bank of England.
Retrieved from https://www.bankofengland.co.uk/-/media/boe/files/report/2019/machine-learning-
in-uk-financial-services.pdf
Biran. O, Cotton. C, (2017) Explanation and justification in machine learning: a survey. Workshop on
Explainable Artificial Intelligence (XAI), pp. 8-13.
Bode, J. (1998). Decision support with neural networks in the management of research and
development: Concepts and application to cost estimation. Information & Management, 34(1), 33-
40. doi: 10.1016/s0378-7206(98)00043-3
Bostrom, Nick; Yudkowsky, Eliezer (2011). "The Ethics of Artificial Intelligence" (PDF). Cambridge
Handbook of Artificial Intelligence. Cambridge Press.
C. Yang, A. Rangarajan and S. Ranka, Global model interpretation via recursive partitioning. In IEEE
20th International Conference on High Performance Computing and Communications; IEEE 16th
International Conference on Smart City; IEEE 4th International Conference on Data Science and
Systems (HPCC/SmartCity/DSS), pp. 1563-1570, 2018, June.
Cauffman, E., Shulman, E., Steinberg, L., Claus, E., Banich, M., Graham, S., & Woolard, J. (2010). Age
differences in affective decision making as indexed by performance on the Iowa Gambling Task.
Developmental Psychology, 46(1), 193-207. https://doi.org/10.1037/a0016128
Chapman, B., Duberstein, P., Sörensen, S., & Lyness, J. (2007). Gender differences in Five Factor
Model personality traits in an elderly cohort. Personality And Individual Differences, 43(6), 1594-
1603. https://doi.org/10.1016/j.paid.2007.04.028
37
Chatterjee, T. (2018). Prediction of Survivors in Titanic Dataset: A Comparative Study using Machine
Learning Algorithms. International Journal Of Emerging Research In Management And Technology,
6(6), 1. https://doi.org/10.23956/ijermt.v6i6.236
Cialdini, R. (2014). Influence: science and practise (6th ed.). Harlow, Essex: Pearson.
Cio Summits. (2019). What consumers really think about AI (p. 2). Cio Summits.
COMPLEXITY | meaning in the Cambridge English Dictionary. (2020). Retrieved 19 February 2020,
from https://dictionary.cambridge.org/dictionary/english/complexity
Computer Crushes the Competition on 'Jeopardy!'. (2020). Retrieved 26 February 2020, from
https://www.cbsnews.com/news/computer-crushes-the-competition-on-jeopardy/
D Gkatzia, O Lemon, and V Rieser. (2016) Natural language generation enhances human
decisionmaking with uncertain information. In ACL
DARPA Announces $2 Billion Campaign to Develop Next Wave of AI Technologies. (2020). Retrieved
20 February 2020, from https://www.darpa.mil/news-events/2018-09-07
Dataman, D. (2020). Explain Your Model with LIME. Medium. Retrieved 15 May 2020, from
https://medium.com/analytics-vidhya/explain-your-model-with-lime-5a1a5867b423.
Dawes, R. (1979). The robust beauty of improper linear models in decision making. American
Psychologist, 34(7), 571-582. doi: 10.1037/0003-066x.34.7.571
de Bruin, W., Parker, A., & Fischhoff, B. (2010). Explaining adult age differences in decision-making
competence. Journal Of Behavioral Decision Making, 25(4), 352-360.
https://doi.org/10.1002/bdm.712
Doshi-Velez, F., Kim, B. (2017) Towards a rigorous science of interpretable machine learning.no ML:
1-13
Encyclopedia Titanica. 2020. Titanic Survivors. [online] Available at: <https://www.encyclopedia-
titanica.org/titanic-survivors/> [Accessed 5 May 2020].
Ericsson, A. (2006). The Cambridge handbook of expertise and expert performance (2nd ed.).
Camebridge press.
Ericsson, K. (2004). Deliberate Practice and the Acquisition and Maintenance of Expert Performance
in Medicine and Related Domains. Academic Medicine, 79(Supplement), S70-S81.
https://doi.org/10.1097/00001888-200410001-00022
38
Ericsson, K., & Lehmann, A. (1996). EXPERT AND EXCEPTIONAL PERFORMANCE: Evidence of Maximal
Adaptation to Task Constraints. Annual Review Of Psychology, 47(1), 273-305.
https://doi.org/10.1146/annurev.psych.47.1.273
Female Business Leaders: Global Statistics. Catalyst. (2020). Retrieved 4 June 2020, from
https://www.catalyst.org/research/women-in-management/.
Fischer, T., & Julsing, M. (2019). Onderzoek doen !. Groningen: Noordhoff Uitgevers.
Fischhoff, B. (1996). The Real World: What Good Is It?. Organizational Behavior And Human Decision
Processes, 65(3), 232-248. https://doi.org/10.1006/obhd.1996.0024
Fong, R. Veldadi, A. (2017) Interpretable Explanations of Black Boxes by Meaningful Perturbation.
Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV)
Goodfellow, I., Bengio, Y., & Courville, A. (2017). Deep learning. The MIT Press.
Greenemeier, L. (2020). 20 Years after Deep Blue: How AI Has Advanced Since Conquering Chess.
[online] Scientific American. Available at: https://www.scientificamerican.com/article/20-years-after-
deep-blue-how-ai-has-advanced-since-conquering-chess/ [Accessed 26 Feb. 2020].
Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., & Pedreschi, D. (2018). A Survey of
Methods for Explaining Black Box Models. ACM Computing Surveys, 51(5), 1-42. doi:
10.1145/3236009
Halford, G., Baker, R., McCredden, J., & Bain, J. (2005). How Many Variables Can Humans Process?.
Psychological Science, 16(1), 70-76. doi: 10.1111/j.0956-7976.2005.00782.x
Hao, K. (2019). We analyzed 16,625 papers to figure out where AI is headed next. Retrieved 4 March
2020, from https://www.technologyreview.com/s/612768/we-analyzed-16625-papers-to-figure-out-
where-ai-is-headed-next/
Herlocker, Konstan, Riedl. (2000) Explaining collaborative filtering recommendations. Computer
supported Cooperative Work (CSCW)
Hinson, J., Jameson, T., & Whitney, P. (2003). Impulsive decision making and working memory.
Journal Of Experimental Psychology: Learning, Memory, And Cognition, 29(2), 298-306. doi:
10.1037/0278-7393.29.2.298
Hodjat, B. (2015). The AI Resurgence: Why Now?. Retrieved 4 March 2020, from
https://www.wired.com/insights/2015/03/ai-resurgence-now/
39
Hu, X., Niu, P., Wang, J., & Zhang, X. (2019). A Dynamic Rectified Linear Activation Units. IEEE Access,
7, 180409-180416. https://doi.org/10.1109/access.2019.2959036
J. L. Herlocker, J. A. Konstan, and J. Riedl. (2000) Explaining collaborative filtering recommendations.
Conference on Computer Supported Cooperative Work (CSCW).
Jiang, X., Pang, Y., Li, X., Pan, J., & Xie, Y. (2018). Deep neural networks with Elastic Rectified Linear
Units for object recognition. Neurocomputing, 275, 1132-1139.
https://doi.org/10.1016/j.neucom.2017.09.056
Kidd, J., Kahneman, D., Slovic, P., & Tversky, A. (1983). Judgement under Uncertainty: Heuristics and
Biasses. The Journal Of The Operational Research Society, 34(3), 254.
https://doi.org/10.2307/2581328
Kingma, D.P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv.
https://arxiv.org/abs/1412.6980
Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., & Mullainathan, S. (2017). Human Decisions and
Machine Predictions*. The Quarterly Journal Of Economics. doi: 10.1093/qje/qjx032
L. Xu, K. Crammer, D. Schuurmans. (2006) Robust support vector machine training via convex outlier
ablation. AAAI
Langley, P., & Simon, H. (1995). Applications of machine learning and rule induction. Communications
Of The ACM, 38(11), 54-64. doi: 10.1145/219717.219768
Letzter, R. (2020). Amazon just showed us that 'unbiased' algorithms can be inadvertently racist.
Retrieved 26 February 2020, from https://www.businessinsider.com/how-algorithms-can-be-racist-
2016-4?international=true&r=US&IR=T
M. Ancona, C. Öztireli and Gross, “Explaining Deep Neural Networks with a Polynomial Time
Algorithm for Shapley Values Approximation,” In ICML, 2019
M. Bilgic, R.J. Mooney (2005). Explaining reccomendations: satisfaction vs promotion. Workshop on
the next stage of reccomender systems research.
M.T. Ribeiro, S. Singh, C. Guestrin. (2016). “Why should I trust you?” Explaining the predictions of any
classifier. Proceedings of the ACM SIGKDD international conference on knowledge discovery and data
mining.
Mark or unmark Spam in Gmail - Computer - Gmail Help. (2020). Retrieved 14 February 2020, from
https://support.google.com/mail/answer/1366858?co=GENIE.Platform%3DDesktop&hl=en
40
Mary T. Dzindolet, Scott A. Peterson, Regina A. Pomranky, Linda G. Pierce, Hall P. Beck (2003) The
role of trust in automation reliance. International Journal of Human-Computer studies.
Matthias, A. (2004). The responsibility gap: Ascribing responsibility for the actions of learning
automata. Ethics And Information Technology, 6(3), 175-183. doi: 10.1007/s10676-004-3422-1
Miller, T. (2019). Explanation in artificial intelligence: Insights from the social sciences. Artificial
Intelligence, 267, 1-38. doi: 10.1016/j.artint.2018.07.007
Mueller, J., & Massaron, L. (2016). Machine Learning For Dummies. For Dummies.
Nair, S. (2009). Marketing research. Himalaya Pub. House.
Netzer, O., Lemaire, A., & Herzenstein, M. (2016). When Words Sweat: Identifying Signals for Loan
Default in the Text of Loan Applications. SSRN Electronic Journal. doi: 10.2139/ssrn.2865327
Nickerson, R. (1998). Confirmation Bias: A Ubiquitous Phenomenon in Many Guises. Review Of
General Psychology, 2(2), 175-220. doi: 10.1037/1089-2680.2.2.175
Nutt, P. (1993). The Identification of Solution Ideas During Organizational Decision Making.
Management Science, 39(9), 1071-1085. https://doi.org/10.1287/mnsc.39.9.1071
Official Journal of the European Union. Regulation (EU) 2016/679 of the European Parliament and of
the Council of 27 April 2016 on the protection of natural persons with regard to the processing of
personal data and on the free movement of such data, and repealing Directive 95/46/EC (General
Data Protection Regulation) (2016).
Or Biran and Kathleen McKeown.(2017) Human-centric justification of machine learning predictions.
In IJCAI, Melbourne, Australia.
Papenmeier, Englebienne, Seifert. (2019) How model accuracy and explanation fidelity influence user
trust in AI. arXiv(2019)
Powell, M., & Ansic, D. (1997). Gender differences in risk behaviour in financial decision-making: An
experimental analysis. Journal Of Economic Psychology, 18(6), 605-628.
https://doi.org/10.1016/s0167-4870(97)00026-3
R. Sinha, K. Swearingen (2002). The role of transparency in reccomender systems. CHI EA.
R. Zhang, T.J. Brennan, A.W. Lo (2014). The origin of risk aversion. Proceedings of the National
Academy of Sciences of the United States of America.
41
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use
interpretable models instead. Nature Machine Intelligence, 1(5), 206-215. doi: 10.1038/s42256-019-
0048-x
S. Carton, J. Helsby, K. Jospeph, A. Mahmud, Y. Park, J. Walsh, C. Cody, E. Patterson, L. Haynes, R.
Ghani. (2016) Identifying police officers at risk of adverse events. 22nd ACM SIGKDD internation
conference.
Saaty, T., & Ozdemir, M. (2003). Why the magic number seven plus or minus two. Mathematical And
Computer Modelling, 38(3-4), 233-244. doi: 10.1016/s0895-7177(03)90083-5
Shahid, N., Rappon, T., & Berta, W. (2019). Applications of artificial neural networks in health care
organizational decision-making: A scoping review. PLOS ONE, 14(2), e0212356. doi:
10.1371/journal.pone.0212356
Sundarajan, M., Taly, A., Yan, Q. (2017). Axiomatic Attribution for Deep Networks. Proceedings of the
34th International Conference on Machine Learning, volume 70, 3319-3328.
Symeonidis, Y. Manolopoulos (2012). A Generalized taxonomy of explanations styles for traditional
and social reccomender systems. Data mining Knoweledge discovery. 24(3):55-583.
Taylor, R. (1975). Age and Experience as Determinants of Managerial Information Processing and
Decision Making Performance. Academy Of Management Journal, 18(1), 74-81.
https://doi.org/10.5465/255626
Techcrunch.com. 2020. Techcrunch Is Now A Part Of Verizon Media. [online] Available at:
<https://techcrunch.com/2017/03/07/google-is-acquiring-data-science-community-
kaggle/?guccounter=1&guce_referrer=aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnLw&guce_referrer_si
g=AQAAAMCM_xWxx7fIVRSuLoqU99ItVAr-3TwqbYmj4-FYg0QYau-
FLkaXnyH1PeI0i8DzyrkjTq2dX6UuRm_lCUdWmgFazfDuuTcHVzarJlbcZSjGYq9LCv7GiENddork0BfC5Ex
M0BIqSEG9WS1Ftoy5M0k2dx-t9gDD-eMx5n9VK9SH> [Accessed 5 May 2020].
Thiel, P., & Masters, B. (2015). Zero to one. [United States]: Bokish Ltd.
Thimm, G., & Fiesler, E. (1997). High-order and multilayer perceptron initialization. IEEE Transactions
On Neural Networks, 8(2), 349-359. https://doi.org/10.1109/72.557673
Todd, E. (2014). Passionate Love and Popular Cinema. Palgrave Macmillan.
Trejos, N. (2007). Existing-Home Sales Fall Steeply. Retrieved 4 March 2020, from
https://www.washingtonpost.com/wp-dyn/content/article/2007/04/24/AR2007042400627.html
42
Universal Declaration of Human Rights. (2020). Retrieved 26 February 2020, from
https://www.un.org/en/universal-declaration-human-rights/
Wernicke, S.: 2015, ‘How to use data to make a hit tv show’.
What do Americans fear?. ScienceDaily. (2020). Retrieved 28 May 2020, from
https://www.sciencedaily.com/releases/2016/10/161012160030.htm.
When Computers Decide: European Recommendations on Machine learning automated decision
making. (2020). Retrieved 14 February 2020, from
https://www.acm.org/binaries/content/assets/public-policy/ie-euacm-adm-report-2018.pdf
Women in Parliaments: World and Regional Averages. Archive.ipu.org. (2020). Retrieved 4 June 2020,
from http://archive.ipu.org/wmn-e/world.htm.
X. Zhang, J. Zhao, Y. LeCun (2016). Character-level Convolutional Networks for text classification.
arXiv.
Zimbardo, P., Johnson, R., & McCann, V. (2016). Psychology: Core concepts (8th ed.). Amsterdam:
Pearson Benelux.
Zion market research. (2017). Machine Learning Market by Service (Professional Services, and
Managed Services), for BFSI, Healthcare and Life Science, Retail, Telecommunication, Government
and Defense, Manufacturing, Energy and Utilities, Others: Global Industry Perspective,
Comprehensive Analysis, and Forecast, 2017-2024. Zion market research.
Zuo, Y. (2019). Research and implementation of human-autonomous devices for sports training
management decision making based on wavelet neural network. Journal Of Ambient Intelligence And
Humanized Computing. doi: 10.1007/s12652-019-01511-y
THE INFLUENCE OF INTERPRETABLEMACHINE LEARNING ON HUMAN
ACCURACY
By: R.D. Sturm
THEORETICAL AREA OF FOCUS
• Decision making
• Machine learning
• Machine learning decision support
“There are very few examples of people
outperforming algorithms in making
predictive judgments. So when there’s the
possibility of using an algorithm, people
should use it” D. Kahneman
CONCEPTUAL FRAMEWORK
• Complex algorithms are on average more accurate than
human agents (Dawes, 1979)
• Providing an explanation increases acceptance and trust
(Herlocker, Konstan, Riedl, 2000; Dzindolet et al, 2003)
• Low trust will cause low acceptance due to “silly mistakes”
(Guidotti et al, 2018)
CONCEPTUAL MODEL
• The more complex a model is, the harder it is to
understand
• But the easier it is to understand the model, the higher the
accuracy of the human agent
• An explanatory mechanism increases understandability
• Secondly, the explantory mechanism increases trust, which
in turn increases accuracy
RESEARCH DESIGN
• Development of a neural network with a LIME explanatory
mechanism
• Experimental survey where participants try to estimate
survival-rates of passengers on the Titanic
• Two groups: treatment-group and a control-group
• Analysis of the data
MODEL DEVELOPMENT AND SURVEY
• Dataset: Titanic passangers, 891 observations.
• Simple neural network with three hidden layers
• LIME-explanatory mechanism
• Survey collected with Qualtrics, 122 respondents
ANALYSIS
H1: When a machine learning model contains an explanatory
mechanism the human agent increases in accuracy.
Estimate: 0.1037, p = 0.0003
H2: When a machine learning model contains an explanatory
mechanism the human agent sees the machine learning model as more
trustworthy
Estimate = 0.8955, p = 8.8e -08
H3: When a machine learning model contains an explanatory
mechanism the human agent sees the machine learning model as more
interpretable.
Estimate = 0.3966, p = 0.0003
H4: the age of the human agent influences the causal link between
machine learning support and human agent accuracy negatively.
Estimate = -0.0008, p = 0.4748 (REJECTED)
H5: the gender of the human agent influences the causal link between
machine learning support and human agent accuracy.
Estimate = -0.0643, p = 0.0329
DISCUSSION
• Empirical evidence of improved performance
• Exact cause improved performance unclear
• Sample size was small and homogenous
• Classification task not respresentative for all decisions
IMPLICATION
• Practical: systematic decicions can be aided by machine
learning model
• Theoretical: interpretation techniques can measure
effectiveness on increased accuracy
QUESTIONS