On Human Predictions with Explanations and Predictions of ...

On Human Predictions with Explanations and Predictions ofMachine Learning Models: A Case Study on Deception Detection

Vivian LaiUniversity of Colorado Boulder

[email protected]

Chenhao TanUniversity of Colorado [email protected]

ABSTRACTHumans are the final decision makers in critical tasks that involveethical and legal concerns, ranging from recidivism prediction, tomedical diagnosis, to fighting against fake news. Although machinelearning models can sometimes achieve impressive performancein these tasks, these tasks are not amenable to full automation. Torealize the potential of machine learning for improving human de-cisions, it is important to understand how assistance from machinelearning models affects human performance and human agency.

In this paper, we use deception detection as a testbed and investi-gate how we can harness explanations and predictions of machinelearning models to improve human performance while retaininghuman agency. We propose a spectrum between full human agencyand full automation, and develop varying levels of machine as-sistance along the spectrum that gradually increase the influenceof machine predictions. We find that without showing predictedlabels, explanations alone slightly improve human performancein the end task. In comparison, human performance is greatly im-proved by showing predicted labels (>20% relative improvement)and can be further improved by explicitly suggesting strong ma-chine performance. Interestingly, when predicted labels are shown,explanations of machine predictions induce a similar level of accu-racy as an explicit statement of strong machine performance. Ourresults demonstrate a tradeoff between human performance andhuman agency and show that explanations of machine predictionscan moderate this tradeoff.

CCS CONCEPTS• Applied computing→ Law, social and behavioral sciences.

KEYWORDShuman agency, human performance, explanations, predictions

ACM Reference Format:Vivian Lai and Chenhao Tan. 2019. On Human Predictions with Expla-nations and Predictions of Machine Learning Models: A Case Study onDeception Detection. In FAT* ’19: Conference on Fairness, Accountability, andTransparency, January 29–31, 2019, Atlanta, GA, USA. ACM, New York, NY,USA, 17 pages. https://doi.org/10.1145/3287560.3287590

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected]* ’19, January 29–31, 2019, Atlanta, GA, USA© 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6125-5/19/01. . . $15.00https://doi.org/10.1145/3287560.3287590

1 INTRODUCTIONMachine learning has achieved impressive success in a wide varietyof tasks. For instance, neural networks have surpassed human-level performance in ImageNet classification (95.06% vs. 94.9%) [29];Kleinberg et al. [36] demonstrate that in bail decisions, machinepredictions of recidivism can reduce jail rates by 41.9% with noincrease in crime rates, compared to human judges; Ott et al. [60]show that linear classifiers can achieve ∼90% accuracy in detectingdeceptive reviews while humans perform no better than chance.As a result of these achievements, machine learning holds promisefor addressing important societal challenges.

However, it is important to recognize different roles that machinelearning can play in different tasks in the context of human decisionmaking. In tasks such as object recognition, human performancecan be considered as the upper bound, and machine learning modelsare designed to emulate the human ability to recognize objects in animage. A high accuracy in such tasks presents great opportunitiesfor large-scale automation and consequently improving our soci-ety’s efficiency. In contrast, efficiency is a lesser concern in taskssuch as bail decisions. In fact, full automation is often not desiredin these tasks due to ethical and legal concerns. These tasks arechallenging for humans and for machines, but with vast amounts ofdata, machines can sometimes identify patterns that are unsalient,unknown, or counterintuitive to humans. If the patterns embeddedin the machine learning models can be elucidated for humans, theycan provide valuable support when humans make decisions.

The goal of our work is to investigate best practices for integrat-ing machine learning into human decision making. We propose aspectrum between full human agency, where humans make deci-sions entirely on their own, and full automation, where machinesmake decisions without human intervention (see Figure 1 for anillustration). We then develop varying levels of machine assistancealong the spectrum using explanations and predictions of machinelearning models. We build on recent developments in interpretablemachine learning that provide useful frameworks for generatingexplanations of machine predictions [34, 35, 45, 50, 64, 65]. Insteadof using these explanations to help users debug machine learningmodels, we incorporate the explanations as assistance for humansto improve human performance while retaining human agencyin the decision making process. Accordingly, we directly evaluatehuman performance in the end task through user studies.

In this work, we focus on a constrained form of decision makingwhere humans make individual predictions. Specifically, we askhumans to decide whether a hotel review is genuine or deceptivebased on the text. This prediction problem allows us to focus on theintegration of machine learning into human predictions. In compar-ison, prior work in decision theory and decision support systemsfocuses on modeling preferences and utilities as well as building

arX

iv:1

811.

0790

1v4

[cs

.AI]

8 J

an 2

019

https://doi.org/10.1145/3287560.3287590

https://doi.org/10.1145/3287560.3287590

FAT* ’19, January 29–31, 2019, Atlanta, GA, USA Vivian Lai and Chenhao Tan

Figure 1: A spectrum between full human agency and full automation illustrating howmachine learning can be integrated inhuman decision making. The detailed explanation of each method is in Section 3.

knowledge databases and representations to reason about complexdecisions [5, 31, 33, 55, 67]. Moreover, since many policy decisionscan be formulated as prediction problems [37], understanding hu-man predictions with assistance from machine learning modelsconstitutes an important step towards empowering humans withmachine learning in critical challenging tasks.Deception detection as a testbed. In this work, we use deceptiondetection as our testbed for three reasons. First, deceptive infor-mation is prevalent on the Internet. For instance, Ott et al. [58]find that deceptive reviews are a growing problem on multiple plat-forms such as TripAdvisor and Yelp. Fake news has also receivedsignificant attention recently [43, 74] and might have influencedthe outcome of the U.S. presidential election in 2016 [3]. Enhancinghumans’ ability in detecting deception can potentially alleviatethese issues.

Second, deception detection is a challenging task for humans andhas been extensively studied [1, 2, 22, 24, 60]. It is promising thatmachines show preliminary success in prior work. For example,machines are able to achieve an accuracy of ∼90% in distinguishinggenuine reviews from deceptive ones, while human performanceis no better than chance [60]. Machines can identify unsalient andcounterintuitive signals, e.g., deceptive reviews are less specificabout spatial configurations and tend to include less sensorial andconcrete language. It is worth noting that we should take the highmachine accuracy with a grain of salt in the general domain becausedeception detection is a complex problem.1 The task introduced byOtt et al. [60] nevertheless provides an ideal sandbox to understandhuman predictions with assistance from machine learning models.

Third, full automation is not desired in critical tasks such asdeception detection because of ethical and legal concerns. Thegovernment should not have the authority to automatically blockinformation from individuals, e.g., in the context of “fake news”. Fur-thermore, full automation may not comply with legal requirements.For instance, in the case of recidivism prediction, the WisconsinSupreme Court ruled that “judges be made aware of the limitationsof risk assessment tools” and “a COMPAS risk assessment shouldnot be used to determine the severity of a sentence or whetheran offender is incarcerated” [47, 71]. Similarly, the trial judge isrequired to act as a gatekeeper regarding the evidence from a poly-graph (lie detector) [70]. Therefore, it is crucial to retain humanagency and understand human predictions with assistance frommachine learning models.

1For instance, one can argue that it is impossible to fully address the issue of deceptionin online reviews only based on textual information as an adversarial user can copyanother user’s review, which becomes a deceptive review but with exactly the sametext as a genuine one.

Organization and Highlights. We start by reviewing relatedwork to provide the necessary background for our study (Section 2).Our focus in this work is on investigating human predictions withassistance from machine learning models in the context of decep-tive review detection. To explore the spectrum between full humanagency and full automation in Figure 1, we develop varying lev-els of assistance from machine learning models (Section 3). Forexample, the following three levels of machine assistance gradu-ally increase the influence of machine predictions: 1) showing onlyexplanations of machine predictions without revealing predictedlabels; 2) showing predicted labels without revealing high machineaccuracy; 3) showing predicted labels with an explicit statement ofstrong machine accuracy.

In Section 4, we investigate human performance under differentexperimental setups along the spectrum. We show that explana-tions alone slightly improve human performance, while showingpredicted labels achieves great improvement (∼21% relative im-provement in human accuracy). However, this improvement is stillmoderate compared to “full” priming with an explicit statement ofmachine accuracy (∼46% relative improvement in human accuracy).Our findings suggest that there exists a tradeoff between humanperformance and human agency. Interestingly, when predicted la-bels are shown, explanations of machine predictions can achieve asimilar effect as an explicit statement of machine accuracy. We alsofind that humans tend to trust correct machine predictions morethan incorrect ones, indicating that they can somewhat identifywhen machines are correct.

We further examine the effect of statements of machine accuracyby varying the accuracy numbers (Section 5). Surprisingly, we findthat our participants are not sensitive to statements of machineaccuracy and are more likely to trust machine predictions with anaccuracy statement than without, even if the accuracy statementsuggests poor machine performance. These observations echo withprior work on numeracy and suggest that it is difficult for humansto interpret and act on numbers [6, 62, 63, 69]. We also find thatfrequency explanations (e.g., 5 out of 10 for explaining 50%) canhelp humans calibrate the accuracy numbers. Note that we donot recommend these presentations on the spectrum because theypresent untruthful information.

We discuss the limitations of our work and provide concludingthoughts regarding future directions of investigating best practicesfor integrating artificial intelligence into human decision makingin Section 6.

On Human Predictions with Explanations and Predictions of Machine Learning Models FAT* ’19, January 29–31, 2019, Atlanta, GA, USA

2 RELATEDWORKWe summarize related work in two areas to put our work in context:interpretable machine learning and deception and misinformation.Interpretable machine learning. Machine learning models re-main as black boxes despite wide adoption. Blindly following ma-chine predictions may lead to dire repercussions, especially in sce-narios such as medical diagnosis and justice systems [9, 36, 73].Therefore, improving their transparency and interpretability hasattracted broad interest [34, 35, 45, 50, 64, 65], dating back to earlywork on recommendation systems [13, 30]. In the case of generalautomation, researchers have also studied issues of appropriatereliance and trust [8, 18, 44, 61, 76].

There are two major approaches to providing explanations ofmachine learning models: example-based and feature-based. Forexample, an example-based explanation framework is MMD-criticproposed by Kim et al. [34], which selects both prototypes andcriticisms. Ribeiro et al. [64] propose a feature-based approach,LIME, that fits a sparse linear model to approximate non-linearmodels locally. Similarly, Lundberg and Lee [50] present a unifiedframework that assigns each feature an importance value for aparticular prediction.

We would like to emphasize two unique aspects of our work:task difficulty and interpretability evaluation. First, compared tocategorizing text into topics and object recognition, deception de-tection is a challenging task for humans and it remains an openquestion whether humans can leverage help from machine learn-ing models in such settings. Second, we directly measure humanperformance in the end task. In comparison, prior work in inter-pretable machine learning aims to help humans understand howmachine learning models work and/or debug them, the evaluationis thus mostly based on either the understanding of the modelsor the improvement in machine performance. Concurrently, sev-eral recent studies have also examined how explanations relate tohuman performance [10, 23]. Our work also resonates with the sem-inal work on mixed-initiative user interfaces [31] and intelligenceaugmentation [4]. In addition, our work is connected to cognitivestudies on understanding effective explanations beyond the contextof machine learning [48, 49].Deception and misinformation. Deception is a widely studiedphenomenon in many disciplines [75]. In psychology, deceptionis defined as an act that is intended to foster in another person abelief or understanding which the deceiver considers false [41]. Todetect deception, researchers have examined the role of behavioral,emotional, and linguistic cues [17, 19, 39, 42, 54, 75].

Since people are increasingly relying on online reviews to makepurchase decisions [11, 72, 78, 81], machine learning methods havebeen used to detect deception in online reviews [22, 24, 32, 60,77, 79]. An important challenge in detecting deception in onlinereviews is to obtain the groundtruth labels of reviews. Ott et al.[60] create the first sizable dataset in deception detection by askingAmazon Mechanical Turkers to write deceptive reviews. Deceptivereviews can also be seen as an instance of spamming and onlinefraud [2, 16, 27, 56].

More recently, the issue of misinformation and fake news hasdrawn much attention from both the public and the research com-munity [21, 43]. Most relevant to our work is Zhang et al. [80],

which explores varying types of credibility annotations specifi-cally designed for news articles. In addition, Nyhan and Reifler [57]demonstrate the “backfire” effect, which suggests that correctionsof misperceptions may enhance people’s false beliefs, and Vosoughiet al. [74] show that fake news is more innovative and spreads fasterthan real news.

It is worth noting that deception detection is a broad and complexissue. For instance, fake news can be hard to define and may not beeasily separated into two classes. Moreover, detecting fake newsis different from detecting deceptive reviews as the former taskrequires other skills such as fact checking. It is important to notethat our focus in this work is on investigating how humans interactwith assistance from machine learning algorithms in decision making.We thus adopt the task of distinguishing genuine reviews fromdeceptive ones based on textual information in Ott et al. [60] as asandbox. Our results on this constrained deception detection taskcan potentially contribute valuable insights to future solutions ofthe broader issue of deception detection.

3 EXPERIMENTAL SETUP AND HYPOTHESESOur goal is to understand whether machine predictions and theirexplanations can improve human performance in challenging tasks,such as deception detection, and how humans interpret assistancefrom machine learning models. In this section, we first present ourtask setup and then develop varying levels of machine assistancealong the spectrum introduced in Figure 1. We finally formulateour hypotheses and define our evaluation metrics.Experimental setup. We employ the deception detection taskdeveloped by Ott et al. [60] and evaluate human performance inthis task with varying levels of machine assistance. The dataset inOtt et al. [60] includes 800 genuine and 800 deceptive hotel reviewsfor 20 hotels in Chicago. The genuine reviews were extracted fromTripAdvisor and the deceptive ones were written by turkers. We use80% of the reviews as training data and the remaining 20% as theheldout test set. Since the machine performance with linear SVM inOtt et al. [60] already surpasses humans (∼50%) by a wide marginand linear classifiers are generally considered more interpretable,we follow Ott et al. [59] and use linear SVM with bag-of-wordsfeatures as our machine learning model. The linear SVM classifierachieves an accuracy of 87% on the heldout test set.

Our main task in this paper is to evaluate human performancewith assistance from machine learning models. To do that, weconduct a user study on Amazon Mechanical Turk. Turkers arerecruited to determine whether a review in the heldout test set isgenuine or deceptive. In other words, humans are asked to per-form the same task as the machine on the test set. We follow abetween-subject design: each turker is assigned a level of machineassistance along the spectrum (Figure 1) and labels 20 reviews aftergoing through three training examples and correctly answeringan attention-check question. To incentivize turkers to perform attheir best, we provide 40% bonus for each correct prediction inaddition to the 5 cent base rate for a review. We also solicit ourparticipants’ estimation of their own performance and basic de-mographic information such as gender and education backgroundthrough an exit survey. We only allow a turker to participate in thestudy once to guarantee sample independence across experimental


(a) Heatmap (without showing predicted labels), an instance of feature-based explanations.

(b) Predicted label with accuracy.

(c) Predicted label + heatmap (without accuracy).Figure 2: Example interfaces with varying levels of machine assistance. Figure 2a only presents feature-based explanationsof machine predictions in the form of heatmap. Figure 2b shows both the predicted label and an explicit statement aboutmachine accuracy (87%). Figure 2c shows the predicted label with heatmap, but does not present machine accuracy. We cropthe “Genuine” and “Deceptive” buttons in Figure 2b and 2c to save space.

setups. Given that there are 320 test reviews and that we collect fiveturker predictions for each review, each experimental setup has atotal of 80 turkers. Refer to the appendix for more details regardingour user study and survey questions.Varying levels of machine assistance. Humans are the mainagents in our experiments and make final decisions; machines onlyprovide assistance, which can be ignored if humans deem it useless.An ideal outcome is that human performance can be improved withminimal information frommachine learning models so that humansretain their agency in the decision making process. To examine howhumans perform under different levels of influence from machinelearning models, we consider the following presentations along thespectrum in Figure 1 (we only show three interfaces in Figure 2 forspace reasons; see the appendix for more).• Control. Humans are only presented a review. This setup con-tains no information from machine learning models and humanshave full agency.

• Feature-based explanations. Since ourmachine learningmodelis linear, we present two versions of feature-based explanationsby highlighting words based on absolute values of weight co-efficients. First, we highlight the top 10 words in each reviewwith the same color (highlight). Second, we use heatmap to showgradual changes in weight coefficients among the top 10 words.The most heavily-weighted words are highlighted in the darkestshade of blue. Soft-highlighting (heatmap) has been shown toimprove visual search on targeted areas for humans [40]. Notethat we do not indicate the sign of features to avoid revealing pre-dicted labels. Humans may pay extra attention to the highlightedwords and accordingly make decisions on their own. Figure 2ashows an example interface for heatmap.

• Example-based explanations. This method (examples) is in-spired by example-based interpretable machine learning [34].Humans are presented two additional reviews from the trainingdata, one deceptive and one genuine that are most similar to thereview under consideration. This setup resonates with nearest


40 60 80 100Accuracy (%)

Predicted labelw/ accuracy

Predicted labelw/o accuracy

Heatmap

Highlight

Examples

Control

p<0.001

p<0.001

p<0.001

p=0.006

p=0.056

74.6

61.9

57.6

55.9

54.4

51.1

(a) Human accuracy with varying levels of machine assistance.

0 25 50 75 100Accuracy (%)

Machineperformance


Predicted label+ heatmap

Predicted label+ examples

Predicted label+ random heatmap


p<0.001

p<0.001

p<0.001

p<0.001

87.0

74.6

72.5

69.7

69.3

61.9

(b) Human accuracy with predicted labels (and other information).Figure 3: Human accuracy with varying levels of assistance. In Figure 3a, control provides no assistance; examples, highlight,and heatmap present explanations ofmachine predictions alone; predicted label w/o accuracy shows predicted labels; predictedlabel w/ accuracy shows predicted labels and reports machine accuracy that suggests strong machine performance. It is clearthat showing predicted labels is crucial for improving human accuracy. Adding an explicit statement of machine accuracyfurther improves human accuracy. Figure 3b further investigates the combinations of predicted labels and their explanations,and presentsmachine performance as a benchmark. Intriguingly, we find that adding explanations achieves a similar effect asadding an explicit statement of machine accuracy. All p-values are computed by conducting t-test between the correspondingsetup and the first experimental setup in the figure (“control” in Figure 3a and “predicted label w/o accuracy” in Figure 3b).

neighbor classifiers. Humans can potentially make better deci-sions in this setup than in control by comparing the similaritybetween reviews.

• Predicted label without accuracy. The above two approachesonly show explanations of machine predictions, but do not re-veal any information about predicted labels. The next level ofpriming presents the predicted label. If humans fully follow ma-chine predictions, they will perform much better than chanceand likely lead to an upper bound in this deception detection taskfor humans. However, humans may not trust the machine due toalgorithm aversion [15].

• Predicted label with accuracy.We may further influence hu-man decisions by explicitly suggesting that machines performwell in this task with 87% accuracy. Figure 2b shows an examplefor predicted label with accuracy. Note that such strong recom-mendations may not be desired due to ethical and legal concerns(see our discussion in the introduction).

• Combinations. Finally, we combine feature (example)-basedexplanations and predicted labels. Note that we do not showmachine performance to avoid strong priming. Figure 2c showsan example of predicted label + heatmap without informationabout machine performance.

Hypotheses. We formulate the following hypotheses regardinghow well humans can perform with machine assistance and how oftenhumans trust machine predictions when predicted labels are available.• Hypothesis 1a. Feature-based explanations and example-basedexplanations improve human performance over control.

• Hypothesis 1b. Heatmap is more effective than highlight as gradualchanges in weight coefficients can be useful, as shown in Kneuseland Mozer [40] for visual search. Feature-based explanations aremore effective than example-based explanations since the latterrequires a greater cognitive load, i.e., reading two more reviews.

• Hypothesis 2. Showing predicted labels significantly improves hu-man performance compared to feature (example)-based expla-nations alone. Assuming that humans trust the machine and

follow its prediction, showing predicted labels can likely improvehuman performance because the machine accuracy is 87%. How-ever, showing predicted labels reduces human agency, so it isimportant to understand the size of the performance gap andmake informed design choices.

• Hypothesis 3.By combining predicted labels and feature (example)-based explanations, the trust that humans place on machine pre-dictions increases, as it has been shown that concrete details caninfluence the level of trust in general automation [44].We evaluate the above hypotheses using two metrics, accuracy

and trust. Accuracy is defined as the percentage of correctly pre-dicted instances by humans; trust is defined as the percentage ofinstances for which humans follow the machine prediction. Notethat we can only compute trust when predicted labels are available.

4 RESULTSIn this section, we investigate how varying levels of assistance frommachine learning models along the spectrum in Figure 1 affect hu-man predictions. We first discuss aggregate human performanceusing human accuracy and trust. Our results show that in thischallenging task, explanations alone slightly improve human per-formance, while showing predicted labels can significantly improvehuman performance. When predicted labels are shown, we examinethe level of trust that humans place on machine predictions. Ourresults suggest that humans can somewhat differentiate correctmachine predictions from incorrect ones. Finally, we present in-dividual differences among our participants based on informationcollected in the exit survey. Our dataset and demonstration areavailable at https://deception.machineintheloop.com/.

4.1 Human AccuracyWe first present human accuracy measured by the percentage ofcorrectly predicted instances by humans. Our results suggest that

https://deception.machineintheloop.com/


0 25 50 75 100Trust (%)


Predicted labelw/ heatmap


Predicted label+ heatmap (random)


p<0.001

p<0.001

p<0.001

p<0.001

79.6

78.7

74.8

73.4

64.4

(a) Trust in machine predictions.

0 25 50 75 100Trust (%)


Predicted labelw/ heatmap


Predicted label+ heatmap (random)


81.1

79.4

75.5

74.5

65.1

69.8

74.1

69.8

65.9

60.0

Correct machine predictionsIncorrect machine predictions

(b) Trust in correct and incorrect machine predictions.Figure 4: The trust that humans place on machine predictions. Figure 4a shows that adding feature-based explanations(heatmap) can effectively increase the trust level compared to predicted label w/o accuracy. p-value in Figure 4a is computedby conducting t-test between the corresponding setup and predicted label w/o accuracy. Figure 4b breaks down the trust basedon whether machine predictions are correct or incorrect and shows that humans trust correct machine predictions more thanthe incorrect ones in all the five experimental setups, although the differences are only statistically significant in two setups.

showing predicted labels is crucial for improving human perfor-mance. Featured-based explanations coupled with predicted labelsare able to induce similar human performance as an explicit state-ment of strong machine accuracy. As such, adding feature-basedexplanations to predicted labels may be more ideal than suggestingstrong machine performance as the priming is weaker and mayfacilitate a higher level of human agency in decision making.Explanations alone slightly improve human performance(Figure 3a). As Figure 3a shows, human performance in controlis no better than chance (51.1%). This finding is consistent withOtt et al. [60] and decades of research on deception detection [7].Explanations alone slightly improve human performance over con-trol, and the differences are statistically significant for highlight andheatmap, not for examples. However, the best explanations, heatmap,is not statistically significantly different from highlight (p = 0.335)or examples (p = 0.069). As a result, our findings partially supportsHypothesis 1a and rejects Hypothesis 1b.

These findings suggest that it is difficult for humans to under-stand explanations on their own. This is plausible for example-basedexplanations since it requires extra cognitive burden and estimatingtext similarity is a nontrivial task for humans. For feature-basedexplanations, it seems that the improvement is driven by the smallnumber of training reviews that we provide to explain the task.First-person singular pronouns provide a good example: one of thetraining reviews is deceptive and highlight many occurrences ofthe word, “my”. A participant said, “I tried to match the pattern fromthe example. In the example. the review with the most "My’s" and"I’s" were deceptive”. In other words, the improvement in heatmapand highlight may not happen at all without the training reviews,which indicates the difficulty of interpreting these feature-basedexplanations and the importance of explaining the explanations.One possible direction is to develop automatic tutorials to teach theintuitions behind important features, which is related to machineteaching [51, 68, 82].Showing predicted labels significantly improves human per-formance (Figure 3a and 3b). As Figure 3a shows, showing pre-dicted labels drastically improves human performance (61.9% forpredicted label w/o accuracy, a 21% relative improvement over con-trol; the differencewith heatmap is statistically significant (p <0.001)).

By presenting machine accuracy as shown in Figure 2b, the perfor-mance is further improved to 74.6% (predicted label w/ accuracy inFigure 3a, a 46% relative improvement over control).

These results are consistent with Hypothesis 2. The big perfor-mance gap between showing predicted labels and showing feature(example)-based explanations alone suggests that when humansinteract with machine learning models, it makes a significant dif-ference whether predicted labels are shown. However, this obser-vation also echoes with concerns about humans overly relying onmachines [44].

To further understand human performance with predicted la-bels, we examine all experimental setups with predicted labels inFigure 3b. Although showing predicted labels seems necessary forachieving sizable human performance improvement, the effect ofpresenting machine accuracy can be moderated by showing fea-ture (example)-based explanations. We find that predicted label +examples and predicted label + heatmap outperform predicted labelw/o accuracy (69.7% and 72.5% vs. 61.9%), without presenting themachine accuracy. In this case, we observe that heatmap is moreeffective than examples, and leads to comparable human perfor-mance with predicted label w/ accuracy. There is still a gap betweenthe best human performance (predicted label w/ accuracy) and ma-chine performance (74.6% vs. 87.0%). These observations suggestthat humans do not necessarily trust machine predictions.

4.2 TrustWe further examine the levels of trust that humans place on ma-chine predictionswhen predicted labels are available. Sincemachineperformance surpasses human performance in control by a widemargin in this task, higher levels of trust are correlated with higherlevels of accuracy in our experiments. However, these two metricscapture different dimensions of human predictions because trustis tied to machine predictions. This becomes clear when we breakdown human trust by whether machine predictions are correct ornot. We find that humans tend to trust correct machine predictionsmore than incorrect ones, which suggests that humans can some-what effectively identify cases where machines are wrong. It isimportant to emphasize that our focus is on understanding how


0 25 50 75 100Percentage of participants (%)



Heatmap

51.2

40.0

42.5

38.8

40.0

32.5

10.0

20.0

25.0

Under estimateCorrect estimateOver estimate

(a) Human estimation of their own performance.

0 25 50 75 100Accuracy (%)

Female

Male

Useful

Not useful

p=0.055

p<0.001

74.1

69.4

77.6

68.9

(b) Gender and hint usefulness in predicted label + heatmap.Figure 5: Heterogeneity findings among participants in our study. Figure 5a shows performance estimation by participants inthree different experimental setups. Figure 5b presents the performance of participants in predicted label + heatmap group bytwo variables, hint usefulness and gender.

human trust varies along the spectrum rather than manipulatingthe trust of humans in machines.Feature (example)-based explanations increase the trust thathumans place onmachine predictions (Figure 4a).We furtherintroduce random heatmap by randomly highlighting an equalnumber of words as in heatmap to examine whether humans areinfluenced by any explanations including random ones.

Our results are consistent with Hypothesis 3: both feature-basedand example-based explanations increase the trust of humans inmachine predictions. In fact, predicted label + heatmap leads to asimilar level of trust as predicted label w/ accuracy, although thelatter explicitly tells humans that the machine learning model “hasan accuracy of approximately 87%”. In other words, when predictedlabels are shown, heatmap can nudge humans in decision makingwithout making strong statements of machine accuracy. Interest-ingly, random heatmap also increases the trust level significantly,suggesting that even irrelevant details can increase the trust of hu-mans in machine predictions. The fact that heatmap is significantlymore effective than random heatmap (78.7% vs. 73.4%, p < 0.001)indicates that humans can interpret valuable information in weightcoefficients beyond “the placebo effect”.Humans tend to trust machine predictions more when ma-chine predictions are correct. (Figure 4b). We next examinewhether humans trust machine predictions more when machinepredictions are correct than when they are incorrect. Figure 4bshows that in all the five experimental setups with predicted la-bels, our participants trust correct machine predictions more thanincorrect ones. However, the difference is statistically significantonly in predicted label w/ accuracy (p < 0.001) and predicted label w/heatmap (random) (p = 0.015). These results suggest that humanscan somewhat differentiate correct machine predictions from incor-rect ones. Further evidence is required to fully understanding thereasons why humans (don’t) trust (in)correct machine predictions.Such understandings can improve both machine learning modelsand their presentations to support human decision making.

4.3 Heterogeneity in Human Perception andPerformance

We finally discuss the heterogeneity between participants in ourstudy. Here we focus on the participants’ estimation of their own

performance and gender differences. Refer to the appendix foradditional comparisons.Human estimation of their own performance (Figure 5a).Weask participants to estimate their own performance in our exit sur-vey. Our results are not exactly aligned with the previous findingthat humans tend to overestimate their capacity of detecting ly-ing [20]. In fact, ∼42% of the participants correctly predicted theirperformance. Among the remaining, ∼18% overestimated their per-formance, while ∼40% underestimated their performance. Figure 5ashows the breakdown for three experimental setups. In general,it seems difficult for humans to estimate their performance. Oneparticipant who overestimated his performance (estimated 11-15but got 10 correct) said, “I enjoyed this hit. When I was a youngman, I was a manager in the hotel business and got to read a lot ofcomment cards from guests. I hope that I was pretty accurate in myanswers”. Another participant who underestimated his performance(estimated 6-10 but got 15 correct) said, “It was difficult to determineif they were genuine or deceptive. I don’t feel certain on any of mychoices”.Heterogeneity in performance across individuals (Figure 5b).Wehave so far focused on average human performance comparisonsbetween different experimental setups. It is important to recognizethat the performance of individuals can vary. Exit survey responsesallow us to study such heterogeneity. We focus on two properties inthe interest of space. Refer to the appendix for a complete discussionof heterogeneity between individuals.

First, individuals who find the hints useful outperform thosewho find the hints not useful. The difference between these twogroups in Figure 5b (predicted label + heatmap) is statistically sig-nificant. This observation resonates with our analysis regardingthe trust of humans in machine predictions and holds in 5 outof 8 experimental setups (this question was not asked in control),although the differences are only statistically significant in threesetups.2 Second, we find that females generally outperform males.This observation holds in 8 out of 9 experimental setups, but noneof the differences is statistically significant. Our results contributeto mixed observations regarding gender differences in deceptiondetection [14, 46, 52, 53].

2The low number of statistically significant differences is expected, because humanperformance is low unless we show predicted labels.


40 50 60 70 80 90 100Accuracy (%)


Predicted label w/ accuracy (50%)+ frequency explanation


Predicted labelw/ accuracy (50%)




p=0.056

p<0.001

p<0.001

p<0.001

p<0.001

61.9

65.1

69.6

70.1

72.7

72.2

74.6

(a) Human accuracy.

40 50 60 70 80 90 100Trust (%)








p=0.157

p<0.001

p<0.001

p<0.001

p<0.001

64.4

66.8

74.3

74.5

76.9

78.6

79.6

(b) Trust.Figure 6: Human accuracy and trust given varying statements of machine accuracy. Figure 6a and Figure 6b show that humanaccuracy and trust generally declinewith statements of decreasingmachine accuracy despite the fact thatmachine predictionsremain unchanged. Note that the decline of human trust with statements of decreasing accuracy is small. Only by addingfrequency explanations, human accuracy and trust become closer to not showing any indication of machine accuracy, i.e.,predicted label w/o accuracy.

5 VARYING STATEMENTS OF MACHINEACCURACY

Given the strong influence of predicted labels andmachine accuracy,a natural question to ask is how human judgment changes if wevary the statement of machine accuracy. For example, instead ofthe true accuracy of 87%, we could claim that the machine hasan accuracy of 60%. It is important to emphasize that since thesestatements of accuracy are not true, we do not recommend thisapproach as part of our spectrum in Figure 1 and thus put theseresults in a separate section. However, we think that it is valuableto understand how varying statements of accuracy might influencehuman predictions.Although human accuracy and trust generally decline withstatements that suggest lower accuracy, statements of ma-chine accuracy improvehuman trust inmachine predictionseven when the claimed accuracy is only 50%. To understandhuman accuracy with varying statements of machine accuracy, weuse predicted label w/o accuracy and predicted label w/ accuracy (87%)as benchmarks. In Figure 6a and Figure 6b, human accuracy andtrust with varying statements of machine accuracy all fall betweenthese two benchmarks as expected. Here we focus on the blue barsfilled with forward slashes that correspond to simple statements ofmachine accuracy, “The machine predicts that the below review isdeceptive. It has an accuracy of approximately x%” (x = 70, 60, 50).As the claimed accuracy declines from 87% to 50%, human accuracyand trust decrease, with the exception of human accuracy from70% to 60%. However, the decline in human trust and accuracyis fairly small. For instance, predicted label w/ accuracy (50%) stilloutperforms predicted label w/o accuracy significantly. The resultsare surprising and counterintuitive since one should put less trustin a machine that has only an accuracy of 50% as compared to amachine that boasts 87%. Our findings suggest that any indicationof machine accuracy, be it high or low, improves human trust inthe machine. This observation echoes prior work on numeracy thatsuggests that average humans and even doctors struggle with inter-preting and acting on numbers [6, 62, 63, 69]. Therefore, it is crucialthat we develop a better empirical understanding of how humansinteract with explanations and predictions of machine learning

models in decision making before using these machine learningmodels in the loop of human decision making.Frequency explanations can help humans interpret and acton statements of machine accuracy. To further investigate hu-man interaction with varying statements of machine accuracy, weadd frequency explanations to the statement with accuracy 50% and60%. Specifically, we show participants “The machine predicts thatthe below review is deceptive. It has an accuracy of approximately50%, which means that it is correct 5 out of 10 times.” instead of“The machine predicts that the below review is deceptive. It has anaccuracy of approximately 50%.” The results are shown with thered bars filled with stars in Figure 6a and Figure 6b. We find thatfrequency explanations reduce the trust that humans place on ma-chine predictions. For instance, human accuracy in predicted label w/accuracy (50%) + frequency explanation is ∼7% lower (p=0.003) thanin predicted label w/ accuracy (50%). Similarly, human trust in pre-dicted label w/ accuracy (50%) + frequency explanation is ∼10% lower(p<0.001) than in predicted label w/ accuracy (50%). Furthermore,the differences in human accuracy and trust are not statisticallysignificant between predicted label w/ accuracy (50%) + frequencyexplanation and predicted label w/o accuracy. These observationssuggest that frequency explanations can help humans interpretstatements of machine accuracy, in which case a statement of 50%accuracy with frequency explanation is almost the same as notshowing machine accuracy. Our frequency explanations are alsoknown as frequent format and have been shown to bemore effectivefor conveying uncertainty than stating the probability [25, 26, 66].

6 CONCLUDING DISCUSSIONIn this paper, we conduct the first empirical study to investigatewhether machine predictions and their explanations can improvehuman performance in challenging tasks such as deception detec-tion. We propose a spectrum between full human agency and fullautomation, and design machine assistance with varying levelsof priming along the spectrum. We find that explanations aloneslighlty improve human performance, while showing predicted la-bels significantly improves human performance. Adding an explicitstatement of strong machine performance can further improve


human performance. Our results demonstrate a tradeoff betweenhuman performance and human agency, and explaining machinepredictions may moderate this tradeoff.

We find interesting results regarding the trust that humans placeon machine predictions. On the one hand, humans tend to trust cor-rect machine predictions more than incorrect ones, which indicatesthat it is possible to improve human decision making while retain-ing human agency. On the other hand, we show that human trustcan be easily enhanced by adding random heatmap as explanationsor statements of low accuracies that do not justify trusting machinepredictions. In other words, additional details including irrelevantones can improve the trust that humans place on machine predic-tions. These findings highlight the importance of taking caution inusing machine learning for supporting decision making and devel-oping methods to improve the transparency of machine learningmodels and its associated human interpretation.

As machine learning gets employed to support decision makingin our society, it is crucial that the machine learning community notonly advances machine learning models, but also develops a betterunderstanding of how these machine learning models are used andhow humans interact with these models in the process of decisionmaking. Our study takes an initial step towards understandinghuman predictions with assistance from machine learning modelsin challenging tasks.Implications and future directions.Our results show that expla-nations alone slightly improve human performance. One reason forthe limited improvement with explanations alone is that althoughwe provide explanations during the decision making process, weprovide limited resources to “teach” these explanations. A possiblefuture direction is to develop tutorials for machine learning modelsand their explanations to relieve some cognitive burden from hu-mans, e.g., summarizing the model as a list of rules, adding heatmapin examples or providing a sequence of training examples with ex-planations and sufficient coverage. This direction also connects tothe area of machine teaching [51, 68, 82].

Another possible direction to improve the effectiveness of expla-nations is to provide narratives. Our results suggest that feature-based and example-based explanations provide useful details formachine predictions to improve the trust of humans in machinepredictions. It can be useful if we can similarly provide rationalesbehind feature-based and example-based explanations in the formof narratives. A qualitative understanding of how turkers inter-pret hints from machine learning models may shed light on therequirements of effective narratives.

Last but not least, it is important to study the ethical concerns ofproviding assistance from machine learning models in human deci-sion making. Our results demonstrate a clear tradeoff in this space:it is difficult to improve human performance without showing pre-dicted labels, but showing predicted labels, especially alongsidemachine performance, runs the risk of removing human agency.Human decision makers with assistance from machines furthercomplicate the current discussions on the issue of fairness in algo-rithmic decision making [12, 28, 38]. As the adoption of machinelearning approaches can have broad impacts on our society, suchquestions require inputs from machine learning researchers, legalscholars, and the entire society.

Limitations.We use Amazon Mechanical Turk to recruit partici-pants, but this may not be a representative sample of the popula-tion. However, we would like to emphasize that turkers are likelyto provide a better proxy than machine learning experts for un-derstanding how humans interact with assistance from machinelearning models in critical challenging tasks. Also, our explanationsare derived from a linear SVM classifier and nearest neighbors. Itmay be even more challenging for humans to interpret explanationsof non-linear classifiers.

Another important challenge in understanding how humans in-teract with machine learning models lies in the difficulty to assessthe generalizability of our results. Our formulation of deceptiondetection represents a scenario where machines outperform hu-mans by a wide margin and humans may have developed falsebeliefs about this task, as most humans have read reviews online.In order to consider a wide range of tasks, e.g., bail decisions andmedical diagnosis, we need a framework to compare different tasks.Machine performance and humans’ prior intuition are probablyimportant factors that can influence human interpretation of theexplanations. However, it remains an open question whether thereexists a principled framework to reason about these tasks. At thevery least, it is important for our community to go beyond simplevisual tasks such as OCR and object recognition, especially for thepurpose of improving human performance in decision making.

REFERENCES[1] Mohamed Abouelenien, Veronica Pérez-Rosas, Rada Mihalcea, and Mihai Burzo.

2014. Deception detection using a multimodal approach. In Proceedings of ICMI.[2] Leman Akoglu, Rishi Chandy, and Christos Faloutsos. 2013. Opinion Fraud

Detection in Online Reviews by Network Effects.. In Proceedings of ICWSM.[3] Hunt Allcott and Matthew Gentzkow. 2017. Social media and fake news in the

2016 election. Journal of Economic Perspectives 31, 2 (2017), 211–236.[4] W Ross Ashby. 1957. An introduction to cybernetics. (1957).[5] James O Berger. 2013. Statistical decision theory and Bayesian analysis. Springer

Science & Business Media.[6] Donald M Berwick, Harvey V Fineberg, and Milton C Weinstein. 1981. When

doctors meet numbers. The American journal of medicine 71, 6 (1981), 991–998.[7] Charles F Bond Jr and Bella M DePaulo. 2006. Accuracy of deception judgments.

Personality and social psychology Review 10, 3 (2006), 214–234.[8] Adrian Bussone, Simone Stumpf, and Dympna O’Sullivan. 2015. The role of expla-

nations on trust and reliance in clinical decision support systems. In HealthcareInformatics (ICHI), 2015 International Conference on. IEEE, 160–169.

[9] Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and NoemieElhadad. 2015. Intelligible models for healthcare: Predicting pneumonia risk andhospital 30-day readmission. In Proceedings of KDD.

[10] Arjun Chandrasekaran, Viraj Prabhu, Deshraj Yadav, Prithvijit Chattopadhyay,and Devi Parikh. 2018. Do explanations make VQA models more predictable to ahuman?. In Proceedings of EMNLP.

[11] Judith A Chevalier and Dina Mayzlin. 2006. The effect of word of mouth on sales:Online book reviews. Journal of marketing research 43, 3 (2006), 345–354.

[12] Sam Corbett-Davies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. 2017.Algorithmic decision making and the cost of fairness. In Proceedings of KDD.

[13] Dan Cosley, Shyong K Lam, Istvan Albert, Joseph A Konstan, and John Riedl.2003. Is seeing believing?: how recommender system interfaces affect users’opinions. In Proceedings of CHI.

[14] Bella M DePaulo, Jennifer A Epstein, and Melissa M Wyer. 1993. Sex differencesin lying: How women and men deal with the dilemma of deceit. (1993).

[15] Berkeley J Dietvorst, Joseph P Simmons, and Cade Massey. 2015. Algorithmaversion: People erroneously avoid algorithms after seeing them err. Journal ofExperimental Psychology: General 144, 1 (2015), 114.

[16] Harris Drucker, Donghui Wu, and Vladimir N Vapnik. 1999. Support vectormachines for spam categorization. IEEE Transactions on Neural networks 10, 5(1999), 1048–1054.

[17] Earl F Dulaney. 1982. Changes in language behavior as a function of veracity.Human Communication Research 9, 1 (1982), 75–82.

[18] Mary T Dzindolet, Scott A Peterson, Regina A Pomranky, Linda G Pierce, andHall P Beck. 2003. The role of trust in automation reliance. International journalof human-computer studies 58, 6 (2003), 697–718.


[19] Paul Ekman, Wallace V Freisen, and Sonia Ancoli. 1980. Facial signs of emotionalexperience. Journal of personality and social psychology 39, 6 (1980), 1125.

[20] Eitan Elaad. 2003. Effects of feedback on the overestimated capacity to detectlies and the underestimated ability to tell lies. Applied Cognitive Psychology 17, 3(2003), 349–363.

[21] Diane Farsetta andDaniel Price. 2006. Fake TV news:Widespread and undisclosed.Center for Media and Democracy 6 (2006).

[22] Song Feng, Ritwik Banerjee, and Yejin Choi. 2012. Syntactic stylometry fordeception detection. In Proceedings of ACL (short papers).

[23] Shi Feng and Jordan Boyd-Graber. 2018. What can AI do for me: Evaluat-ing Machine Learning Interpretations in Cooperative Play. arXiv preprintarXiv:1810.09648 (2018).

[24] Vanessa Wei Feng and Graeme Hirst. 2013. Detecting deceptive opinions withprofile compatibility. In Proceedings of IJCNLP.

[25] Gerd Gigerenzer. 1996. The psychology of good judgment: frequency formatsand simple algorithms. Medical decision making 16, 3 (1996), 273–280.

[26] Gerd Gigerenzer and Ulrich Hoffrage. 1995. How to improve Bayesian reasoningwithout instruction: frequency formats. Psychological review 102, 4 (1995), 684.

[27] Zoltán Gyöngyi, Hector Garcia-Molina, and Jan Pedersen. 2004. Combating webspam with trustrank. In Proceedings of VLDB.

[28] Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity insupervised learning. In Proceedings of NIPS.

[29] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deepinto rectifiers: Surpassing human-level performance on imagenet classification.In Proceedings of ICCV.

[30] Jonathan L Herlocker, Joseph A Konstan, and John Riedl. 2000. Explainingcollaborative filtering recommendations. In Proceedings of CSCW.

[31] Eric Horvitz. 1999. Principles of mixed-initiative user interfaces. In Proceedingsof CHI.

[32] Nitin Jindal and Bing Liu. 2008. Opinion spam and analysis. In Proceedings ofWSDM.

[33] Peter GW Keen. 1978. Decision support systems; an organizational perspective.Technical Report.

[34] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. 2016. Examples are notenough, learn to criticize! criticism for interpretability. In Proceedings of NIPS.

[35] Been Kim, Cynthia Rudin, and Julie A Shah. 2014. The bayesian case model: Agenerative approach for case-based reasoning and prototype classification. InProceedings of NIPS.

[36] Jon Kleinberg, Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, and SendhilMullainathan. 2017. Human decisions and machine predictions. The QuarterlyJournal of Economics 133, 1 (2017), 237–293.

[37] Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan, and Ziad Obermeyer. 2015.Prediction policy problems. American Economic Review 105, 5 (2015), 491–95.

[38] Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan. 2017. Inherenttrade-offs in the fair determination of risk scores. Proceedings of ITCS.

[39] Mark L Knapp, Roderick P Hart, and Harry S Dennis. 1974. An exploration ofdeception as a communication construct. Human communication research 1, 1(1974), 15–29.

[40] Ronald T Kneusel and Michael C Mozer. 2017. Improving Human-MachineCooperative Visual Search With Soft Highlighting. ACM Transactions on AppliedPerception (TAP) 15, 1 (2017), 3.

[41] Robert M Krauss, Valerie Geller, and Christopher Olson. 1976. Modalities andcues in the detection of deception. In Meeting of the American PsychologicalAssociation, Washington, DC.

[42] Mark L Knapp and Mark E Comaden. 1979. Telling it like it isn’t: A reviewof theory and research on deceptive communications. Human CommunicationResearch 5, 3 (1979), 270–285.

[43] David MJ Lazer, Matthew A Baum, Yochai Benkler, Adam J Berinsky, Kelly MGreenhill, Filippo Menczer, Miriam J Metzger, Brendan Nyhan, Gordon Penny-cook, David Rothschild, Michael Schudson, Steven A. Sloman, Cass R. Sunstein,Emily A. Thorson, Duncan J. Watts, and Jonathan L. Zittrain. 2018. The scienceof fake news. Science 359, 6380 (2018), 1094–1096.

[44] John D Lee and Katrina A See. 2004. Trust in automation: Designing for appro-priate reliance. Human factors 46, 1 (2004), 50–80.

[45] Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016. Rationalizing neural predic-tions. Proceedings of EMNLP.

[46] Li Li. 2011. Sex differences in deception detection.[47] Adam Liptak. 2017. Sent to Prison by a Software Program’s Secret Algorithms.[48] Tania Lombrozo. 2006. The structure and function of explanations. Trends in

cognitive sciences 10, 10 (2006), 464–470.[49] Tania Lombrozo. 2007. Simplicity and probability in causal explanation. Cognitive

psychology 55, 3 (2007), 232–257.[50] Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model

predictions. In Proceedings of NIPS.[51] Oisin Mac Aodha, Shihan Su, Yuxin Chen, Pietro Perona, and Yisong Yue. 2018.

Teaching categories to human learners with visual explanations. In Proceedingsof CVPR.

[52] Samantha Mann, Aldert Vrij, and Ray Bull. 2004. Detecting true lies: policeofficers’ ability to detect suspects’ lies. Journal of applied psychology 89, 1 (2004),137.

[53] Steven A McCornack and Malcolm R Parks. 1990. What women know thatmen don’t: Sex differences in determining the truth behind deceptive messages.Journal of Social and Personal Relationships 7, 1 (1990), 107–118.

[54] Albert Mehrabian. 1971. Silent messages. Vol. 8. Wadsworth Belmont, CA.[55] Allen Newell and Herbert Alexander Simon. 1972. Human problem solving.

Vol. 104. Prentice-Hall Englewood Cliffs, NJ.[56] Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. 2006.

Detecting spam web pages through content analysis. In Proceedings of WWW.[57] Brendan Nyhan and Jason Reifler. 2010. When corrections fail: The persistence

of political misperceptions. Political Behavior 32, 2 (2010), 303–330.[58] Myle Ott, Claire Cardie, and Jeff Hancock. 2012. Estimating the prevalence of

deception in online review communities. In Proceedings of WWW.[59] Myle Ott, Claire Cardie, and Jeffrey T Hancock. 2013. Negative deceptive opinion

spam. In Proceedings of NAACL.[60] Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T Hancock. 2011. Finding decep-

tive opinion spam by any stretch of the imagination. In Proceedings of ACL.[61] Raja Parasuraman and Victor Riley. 1997. Humans and automation: Use, misuse,

disuse, abuse. Human factors 39, 2 (1997), 230–253.[62] Ellen Peters, Daniel Västfjäll, Paul Slovic, CK Mertz, Ketti Mazzocco, and Stephan

Dickert. 2006. Numeracy and decision making. Psychological science 17, 5 (2006),407–413.

[63] Valerie F Reyna and Charles J Brainerd. 2008. Numeracy, ratio bias, and denom-inator neglect in judgments of risk and probability. Learning and individualdifferences 18, 1 (2008), 89–107.

[64] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should itrust you?: Explaining the predictions of any classifier. In Proceedings of KDD.

[65] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Anchors: High-Precision Model-Agnostic Explanations. In Proceedings of AAAI.

[66] Peter Sedlmeier and Gerd Gigerenzer. 2001. Teaching Bayesian reasoning in lessthan two hours. Journal of Experimental Psychology: General 130, 3 (2001), 380.

[67] Jung P Shim, Merrill Warkentin, James F Courtney, Daniel J Power, RameshSharda, and Christer Carlsson. 2002. Past, present, and future of decision supporttechnology. Decision support systems 33, 2 (2002), 111–126.

[68] Adish Singla, Ilija Bogunovic, Gábor Bartók, Amin Karbasi, and Andreas Krause.2014. Near-Optimally Teaching the Crowd to Classify.. In Proceedings of ICML.

[69] Paul Slovic and Ellen Peters. 2006. Risk perception and affect. Current directionsin psychological science 15, 6 (2006), 322–325.

[70] Supreme Court of the United States. 1993. Daubert v. Merrell Dow Pharmaceuti-cals, Inc. 509 U.S. 579.

[71] Supreme Court of Wisconsin. 2016. State of Wisconsin, Plaintiff-Respondent, v.Eric L. Loomis, Defendant-Appellant.

[72] Michael Trusov, Randolph E Bucklin, and Koen Pauwels. 2009. Effects of word-of-mouth versus traditional marketing: findings from an internet social networkingsite. Journal of marketing 73, 5 (2009), 90–102.

[73] Kush R Varshney. 2016. Engineering safety in machine learning. In InformationTheory and Applications Workshop (ITA), 2016.

[74] Soroush Vosoughi, Deb Roy, and Sinan Aral. 2018. The spread of true and falsenews online. Science 359, 6380 (2018), 1146–1151.

[75] Aldert Vrij. 2000. Detecting lies and deceit: The psychology of lying and implicationsfor professional practice. Wiley.

[76] Christopher D Wickens, Justin G Hollands, Simon Banbury, and Raja Parasura-man. 2015. Engineering psychology & human performance. Psychology Press.

[77] Guangyu Wu, Derek Greene, Barry Smyth, and Pádraig Cunningham. 2010.Distortion as a validation criterion in the identification of suspicious reviews. InProceedings of the First Workshop on Social Media Analytics.

[78] Qiang Ye, Rob Law, Bin Gu, and Wei Chen. 2011. The influence of user-generatedcontent on traveler behavior: An empirical investigation on the effects of e-word-of-mouth to hotel online bookings. Computers in Human behavior 27, 2 (2011),634–639.

[79] Kyung-Hyan Yoo and Ulrike Gretzel. 2009. Comparison of deceptive and truthfultravel reviews. Information and communication technologies in tourism 2009 (2009),37–47.

[80] Amy X Zhang, Aditya Ranganathan, Sarah Emlen Metz, Scott Appling, Con-nie Moon Sehat, Norman Gilmore, Nick B Adams, Emmanuel Vincent, MartinRobbins, Ed Bice, Sandro Hawke, David Karger, and An Xiao Mina. 2018. AStructured Response to Misinformation: Defining and Annotating CredibilityIndicators in News Articles. In Proceedings of WWW (Companion).

[81] Ziqiong Zhang, Qiang Ye, Rob Law, and Yijun Li. 2010. The impact of e-word-of-mouth on the online popularity of restaurants: A comparison of consumerreviews and editor reviews. International Journal of Hospitality Management 29,4 (2010), 694–700.

[82] Xiaojin Zhu. 2015. Machine Teaching: An Inverse Problem to Machine Learningand an Approach Toward Optimal Education. In Proceedings of AAAI.


ACKNOWLEDGMENTSWe would like to thank Elizabeth Bradley, Michael Mozer, SendhilMullainathan, Amit Sharma, Adith Swaminathan, and anonymousreviewers for helpful discussions and feedback. This material isbased upon work supported by the National Science Foundationunder Grant No. 1837986. Any opinions, findings, and conclusionsor recommendations expressed in this material are those of theauthor(s) and do not necessarily reflect the views of the NationalScience Foundation.

A APPENDIXA.1 Amazon Mechanical Turk SetupTo ensure quality results, we include several criteria for turkers:1) the turker is based in the United States so that we assume Eng-lish fluency; 2) the turker has completed at least 50 HITs (humanintelligence tasks); 3) the turker has an approval rate of at least 99%.

Before working on the main task, turkers need to go through ashort training session, in which we show three reviews from thetraining data. We present the correct answer after turkers maketheir prediction. The interface during training is exactly the same asin the actual experiment. After making predictions for 20 reviews,turkers are required to fill out an exit survey that solicits theirestimation of their own performance in this task and basic demo-graphic information including age, gender, education background,and experience with online reviews (screenshots in Figure 15 andFigure 16). If the HIT is approved, the turker is compensated a dol-lar and bonuses depending on the number of reviews he correctlypredicted. For example, if a turker makes 11 correct predictions, heis compensated $0.22 in addition to a dollar. The average durationfor finishing our HIT is about 11 minutes (Figure 7 shows the CDFof the duration). Turkers spend the shortest amount of time onaverage (8.3 minutes) in predicted labels w/ accuracy and the longestamount of time on average (14.4 minutes) in examples, which isconsistent with our expectation about extra cognitive burden fromreading two more reviews. To sanity check that participants paysimilar attention throughout the study, Figure 8 shows the averageaccuracy with respect to the order in which reviews show up3:there does not exist a downward trend. All results are based on the9 experimental setups in Section 4 of the main paper and resultswith varying statements of accuracy are not included.

A.2 Experiment InterfacesThis section shows example interfaces for the other five exper-imental setups that are not shown in the main paper (predictedlabel + heatmap (random) has the same interface as predicted label+ heatmap except that words are highlighted randomly).

• Control (Figure 17a).• Highlight (Figure 17b).• Examples (Figure 18a).• Predicted label w/o accuracy (Figure 18b).• Predicted label + examples (Figure 19).

3Thanks to suggestions from anonymous reviewers.

<=3 <=6 <=9 <=12 <=15 <=18 <=21 <=24 <=27 <=30 <=35duration (min)

20

40

60

80

100

wor

kers

(%)

Figure 7: Cumulative distribution of study duration in 9 ex-perimental setups.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20review number

0

10

20

30

40

50

60

70

80

aver

age

accu

racy

(%)

59 57 5956

60

6770

58

6569

61 6057

62

54

6972

6671 70

Figure 8: Average accuracy with respect to review orderingin 9 experimental setups.

A.3 Individual DifferencesHere we present further results on heterogeneous performanceamong individuals. We present figures for four experimental setupsthat are representative of different levels of priming: heatmap, ex-amples, predicted label w/o accuracy, and predicted label + heatmap.Hint usefulness (Figure 9). As discussed in the main paper, hu-man performance is better for participants who find hints usefulthan those who do not find hints useful in 5 out of 8 experimentalsetups. Highlight, heatmap and predicted label w/o accuracy are theexceptions. The difference in three setups (predicted label + heatmap,predicted label + heatmap (random), predicted label w/ accuracy) isstatistically significant.

0 25 50 75 100Accuracy (%)



Examples

Heatmap

77.6

61.7

56.2

56.6

68.9

62.0

53.8

58.2

UsefulNot useful

Figure 9: Human accuracy vs. usefulness of hints.


Gender differences (Figure 10). Females generally outperformmales, in 8 out of 9 experimental setups. None of the differences isstatistically significant.

0 25 50 75 100Accuracy (%)



Examples

Heatmap

69.4

61.3

54.2

57.9

74.1

62.4

54.6

57.2

MaleFemale

Figure 10: Human accuracy vs. gender.

Review sentiments (Figure 11). One possible hypothesis is thathumans perform differently depending on the sentiment of reviews.Indeed, we observe that humans consistently perform better for pos-itive reviews (8 out 9 experimental setups). However, the differenceis only statistically significant for predicted label w/o accuracy.

0 25 50 75 100Accuracy (%)



Examples

Heatmap

73.9

65.8

54.0

58.5

71.1

58.0

54.9

56.8

PositiveNegative

Figure 11: Human accuracy vs. review sentiment.

Education background (Figure 12). There is no clear trend re-garding education background, which suggests that education levelsdo not correlate with the ability to detect deception. For instance,high school graduates perform the best in predicted label w/o ac-curcay, but the worst in examples. Since there are five groups, eachgroup is relatively sparse. We thus did not conduct statistical testingfor these observations.Age group (Figure 13). There is no clear trend regarding agegroups either. For instance, participants that are 61 & above per-form the best in predicted label w/o accuracy, but worst in predictedlabel + heatmap. Similarly, since there are five groups and that eachgroup is also relatively sparse, we did not conduct statistical testingfor these observations.Review experience (Figure 14). There is no clear trend regardingexperience of writing reviews. With the exception of control andpredicted label + heatmap (random), the group that reports the bestperformance is either users who write reviews weekly or userswho write reviews frequently. Again, we did not conduct statisticaltesting for review experience.

0 20 40 60 80 100Accuracy (%)



Examples

Heatmap

N.A.

N.A.

N.A.

N.A.

72.3

70.0

49.2

59.3

71.3

63.8

55.0

60.8

76.8

59.4

51.0

57.5

71.9

60.0

55.2

55.4

Some high schoolHigh school graduateSome college creditVocational trainingBachelor’s degree

Figure 12: Human accuracy vs. education background.

0 20 40 60 80 100Accuracy (%)



Examples

Heatmap

73.3

64.6

52.7

55.6

72.0

59.5

54.7

55.9

75.3

65.0

53.7

63.2

57.5

80.0

61.3

N.A.

N.A.

N.A.

N.A.

N.A.

18 - 2526 - 4041 - 6061 & aboveI prefer not to answer

Figure 13: Human accuracy vs. age groups.

0 20 40 60 80 100Accuracy (%)



Examples

Heatmap

70.8

61.3

56.2

58.7

72.7

64.1

54.5

58.2

73.5

56.6

52.1

56.3

75.0

65.8

60.0

58.0

65.0

N.A.

45.0

40.0

NeverYearlyMonthlyWeeklyFrequently

Figure 14: Human accuracy vs. review writing experience.


Figure 15: Survey questions for control group.


Figure 16: Survey questions for all the other groups.


(a) Example interface for control.

(b) Example interface for highlight.


(a) Example interface for examples.

(b) Example interface for predicted label w/o accuracy.


Figure 19: Example interface for predicted label + examples.

Date post:	29-Mar-2022
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

On Human Predictions with Explanations and Predictions of ...

Documents