Economic Theory and Statistical Learning
CitationLiang, Annie. 2016. Economic Theory and Statistical Learning. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
Permanent linkhttp://nrs.harvard.edu/urn-3:HUL.InstRepos:33493561
Terms of UseThis article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAA
Share Your StoryThe Harvard community has made this article openly available.Please share how this access benefits you. Submit a story .
Accessibility
Economic Theory and Statistical Learning
A dissertation presented
by
Annie Liang
to
The Department of Economics
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in the subject of
Economics
Harvard University
Cambridge, Massachusetts
May 2016
c� 2016 Annie Liang
All rights reserved.
Dissertation Advisors:Professor Drew FudenbergProfessor Jerry Green
Author:Annie Liang
Economic Theory and Statistical Learning
Abstract
This dissertation presents three independent essays in microeconomic theory. Chap-
ter 1 suggests an alternative to the common prior assumption, in which agents
form beliefs by learning from data, possibly interpreting the data in different ways.
In the limit as agents observe increasing quantities of data, the model returns strict
solutions of a limiting complete information game, but predictions may diverge
substantially for small quantities of data. Chapter 2 (with Jon Kleinberg and
Sendhil Mullainathan) proposes use of machine learning algorithms to construct
benchmarks for “achievable" predictive accuracy. The paper illustrates this ap-
proach for the problem of predicting human-generated random sequences. We
find that leading models explain approximately 10-15% of predictable variation
in the problem. Chapter 3 considers the problem of how to interpret inconsistent
choice data, when the observed departures from the standard model (perfect maxi-
mization of a single preference) may emerge either from context-dependencies in
preference or from stochastic choice error. I show that if preferences are “simple"
in the sense that they consist only of a small number of context-dependencies, then
the analyst can use a proposed optimization problem to recover the true number
of underlying context-dependent preferences.
iii
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Introduction 1
1 Games of Incomplete Information Played by Statisticians 31.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3 Preliminaries and Notation . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 The game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.3.2 Beliefs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3.3 Solution concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4 Learning from Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181.4.1 When do agents commonly learn? . . . . . . . . . . . . . . . . 21
1.5 Robustness to Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 231.5.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231.5.2 Bayesian Nash Equilibrium . . . . . . . . . . . . . . . . . . . . 271.5.3 Rationalizable Actions . . . . . . . . . . . . . . . . . . . . . . . 30
1.6 How Much Data do Agents Need? . . . . . . . . . . . . . . . . . . . . 371.6.1 Bayesian Nash Equilibrium . . . . . . . . . . . . . . . . . . . . 371.6.2 Rationalizable Actions . . . . . . . . . . . . . . . . . . . . . . . 411.6.3 Diversity across Inference Rules in M . . . . . . . . . . . . . . 42
1.7 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441.7.1 Misspecification . . . . . . . . . . . . . . . . . . . . . . . . . . . 451.7.2 Private Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461.7.3 Limit Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . 47
1.8 Related Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471.8.1 Robustness of equilibrium and equilibrium refinements . . . 47
iv
1.8.2 Role of higher-order beliefs . . . . . . . . . . . . . . . . . . . . 491.8.3 Agents who learn from data . . . . . . . . . . . . . . . . . . . 501.8.4 Model uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . 511.8.5 Epistemic game theory . . . . . . . . . . . . . . . . . . . . . . . 52
1.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2 The Theory is Predictive, but is it Complete? An Application to HumanPerception of Randomness 542.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542.2 Primary setting: human generation of coin flips . . . . . . . . . . . . 59
2.2.1 Description of data . . . . . . . . . . . . . . . . . . . . . . . . . 592.2.2 Theories of misperception . . . . . . . . . . . . . . . . . . . . . 612.2.3 Prediction tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 632.2.4 Establishing a benchmark . . . . . . . . . . . . . . . . . . . . . 662.2.5 Other possible benchmarks . . . . . . . . . . . . . . . . . . . . 67
2.3 Transfer across Domains . . . . . . . . . . . . . . . . . . . . . . . . . . 702.3.1 Prediction of New Alphabets . . . . . . . . . . . . . . . . . . . 712.3.2 Prediction of Subsequent Flips . . . . . . . . . . . . . . . . . . 73
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752.4.1 Guarantees on the benchmark . . . . . . . . . . . . . . . . . . 752.4.2 Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762.4.3 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.5 Relationship to Literature . . . . . . . . . . . . . . . . . . . . . . . . . 772.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3 Interpretation of Inconsistent Choice Data: How Many Context-DependentPreferences are There? 793.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 793.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 833.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873.5 Recovery Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.5.1 Class of choice models . . . . . . . . . . . . . . . . . . . . . . . 903.5.2 Can we recover the number of orderings? . . . . . . . . . . . 913.5.3 Can we recover more? . . . . . . . . . . . . . . . . . . . . . . . 98
3.6 Relationship to Literature . . . . . . . . . . . . . . . . . . . . . . . . . 102
v
References 105
Appendix A Appendix to Chapter 1 111A.1 Notation and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . 111A.2 Preliminary Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112A.3 Appendix C: Main Results . . . . . . . . . . . . . . . . . . . . . . . . . 114
A.3.1 Proof of Claim 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 114A.3.2 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . 115A.3.3 Proof of Claim 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 118A.3.4 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . 120A.3.5 Proof of Proposition 2 . . . . . . . . . . . . . . . . . . . . . . . 124A.3.6 Proof of Proposition 3 . . . . . . . . . . . . . . . . . . . . . . . 125A.3.7 Proof of Proposition 4 . . . . . . . . . . . . . . . . . . . . . . . 128
A.4 Appendix D: An example illustrating the fragility of weak strict-rationalizability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Appendix B Appendix to Chapter 2 133B.1 Experiment Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . 133B.2 Behavioral Prediction Rules . . . . . . . . . . . . . . . . . . . . . . . . 133
Appendix C Appendix to Chapter 3 137C.1 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
C.1.1 Preliminary Notation and Results . . . . . . . . . . . . . . . . 137C.1.2 Main Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
C.2 Corollary 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142C.3 Corollary 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142C.4 Proof of Proposition 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
vi
List of Tables
2.1 The empirical probability of Heads, conditional on three fixed previous flips: (1)
the actual proportion of generated Heads in our data, (2) the assessed probability
of Heads next flip from Rapaport & Budescu (1997), as presented in Rabin and
Vayanos (2010), (3) probabilities consistent with a Bernoulli(0.5) process. . . . . . 612.2 Prediction errors achieved using Rabin (2002) and Rabin (2010) are improvements
on the prediction error achieved by guessing at random. How do we assess the size
of this improvement? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652.3 Comparison of prediction error achieved using behavioral models with prediction
error achieved using table lookup. The behavioral models explain between 9-12%
of the explainable variation in the data. . . . . . . . . . . . . . . . . . . . . . 672.4 Comparison of prediction error achieved using behavioral models with prediction
error achieved using table lookup. . . . . . . . . . . . . . . . . . . . . . . . . 692.5 We train table lookup and our two behavioral models on the original 8-length
{H, T} data, and then use the estimated data to predict 8-length {r, 2} and {@, !}data. Reported prediction errors are tenfold cross-validated mean squared errors. . 73
2.6 We train table lookup and our two behavioral models on the original 8-length
{H, T} data, and then use the estimated data to predict the data in {Dk:k+7}8k=2.
Reported prediction errors are tenfold cross-validated mean squared errors. . . . 74
vii
List of Figures
1.1 Two relevant attributes (r = 2). Yield is high under environmental conditions in
[�c0, c0]2, and low otherwise. Farmers do not know the high yield region (shaded). 91.2 Circles indicate low yields, and squares indicate high yields. The two rectangles
identify partitionings (predict high yield if x is contained within the rectangle, and
low yield otherwise) that exactly fit the data. . . . . . . . . . . . . . . . . . . . 101.3 Every axis-aligned rectangle partitioning predicts high yield at the origin, but there
exists a rotated rectangle partitioning that predicts low yield. . . . . . . . . . . . 131.4 The map h takes first-order beliefs µ into expected payoff functions. . . . . . . . 281.5 The set UR
a⇤iis partitioned such that every agent’s set of rationalizable actions is
constant across each element of the partition. There are three cases: (1) if u⇤ is on
the boundary of URa⇤i
(e.g. u1), then a⇤i is not robust to inference; (2) if u⇤ is in the
interior of URa⇤i
, and moreover in the interior of a partition element (u2), then a⇤i is
certainly robust to inference; (3) if u⇤ in the interior of URa⇤i
, but not in the interior of
any partition element (u3), then a⇤i may not be robust to inference. See Appendix D
for an example of the last case. . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.1 (a) Top row. Distribution of the number of heads in the realized string. Left: Com-
parison of MTurk data with theoretical Bernoulli predictions. Right: Comparison of
Nickerson & Butler (2009) data with theoretical Bernoulli predictions. (b) Bottom
row. Distribution of proportion of runs which are of length m. Left: Comparison of
MTurk data with theoretical Bernoulli predictions. Right: Comparison of Nickerson
& Butler (2009) data with theoretical Bernoulli predictions. . . . . . . . . . . . . 60
3.1 The problem in (3.2) returns a solution with 2 orderings if and only if the line with
normal vector (�1,�l) supports F at (2, D2). . . . . . . . . . . . . . . . . . . 893.2 Studying rationalizability of a dataset is equivalent to studying colorability of a
graph in which nodes represent observations and edges represent inconsistencies. 96
viii
C.1 Any choice of l for which (�1,�l) is a subgradient of fH at (K, DK,H) will recover
K. With high probability, the set of vectorsn
(�1,� 1(p+d)
) | d 2⇣
0, d(1�p)K
M � 2p � b⌘o
is a subset of the subdifferential of fH at (K, DK,H). . . . . . . . . . . . . . . . . 139
ix
Acknowledgments
I am deeply grateful to my advisors, Drew Fudenberg, Jerry Green, David Laibson,
and Sendhil Mullainathan. Sendhil Mullainathan’s support was essential in my
path to and through graduate school, and his unique insight is a singular inspiration.
David Laibson’s advising, along with his thoughtful example, has been the single
most important input into how I think about the purpose of my work, and what
I hope to contribute. Jerry Green guided me as a first-year graduate student,
helping me to discover the joys and challenges of research, and his generous
mentorship has been invaluable since then. Drew Fudenberg’s dedicated training,
mentorship, and guidance have been crucial to my development as a researcher,
and the development of my research identity.
In addition to my advisors, I am thankful also to Andrei Shleifer, Tomasz
Strzalecki, and Eric Maskin for many important conversations throughout graduate
school.
Finally, I am above all grateful to my parents Bo Liu and Zhi-Pei Liang, and to
my brother, Danny Liang. I love them dearly.
x
To my parents and brother
xi
Introduction
This dissertation comprises three independent chapters that study topics at the
intersection of economic theory and statistical learning.
My job market paper (“Games with Incomplete Information Played by Statisti-
cians") develops a new framework for modeling beliefs in incomplete information
games. The standard approach assumes that agents share a common prior distribu-
tion over uncertainty, and thus have a common (ex-ante) model of the world. It
is known that this assumption has several strong implications that conflict with
empirical evidence—in particular, it precludes public and persistent disagreement.
However, the question of what kind of heterogeneity in beliefs to allow, and how
to do so in a structured way, remains open. My paper proposes a reformulation
of incomplete information games, in which agents form beliefs by learning from
data. The key feature of this approach is that, in the absence of a “privileged"
or “correct" model for interpreting data, the kind of disagreement that the stan-
dard model precludes arises naturally as a consequence of statistical ambiguity.
I use this framework to study the robustness of predictions (under the standard
model) to the introduction of statistical uncertainty, and additionally propose a
criterion for equilibrium selection that removes solutions that are only supported
by unreasonably large quantities of data.
The second segment (“The Theory is Predictive, but is it Complete? An Ap-
1
plication to Human Perception of Randomness," with Jon Kleinberg and Sendhil
Mullainathan) develops a way to measure the “completeness" of an economic the-
ory. Current methods for testing theories of economic behavior focus on whether
the predictions of the theory match what we see in the available data. But we also
care about the extent to which the predictable variation in data is captured by the
theory. This second property is difficult to measure, because in general we do not
know how much “predictable variation" there is in the problem. We propose the
use of machine learning algorithms to construct a benchmark for the “achievable
level of prediction," and illustrate this approach on the problem of predicting
human generation of random sequences. We find that leading behavioral models
explain approximately 10-15% of the variation in the data explained using an
atheoretical machine learning algorithm. This suggests that there is remaining
predictable structure in the problem to be uncovered.
The final segment of my work involves the discovery of context-dependencies
in preference. In “Interpretation of Inconsistent Choice: How Many Context-
Dependent Preferences are There?", I consider the problem of how to interpret
inconsistent choice data, when the observed departures from the standard model
(perfect maximization of a single preference) may emerge either from context-
dependencies in preference or from stochastic choice error. I show that if pref-
erences are “simple" in the sense that they consist only of a small number of
context-dependencies, then the analyst can use a proposed optimization problem
to recover the true number of underlying context-dependent preferences.
2
Chapter 1
Games of Incomplete Information
Played by Statisticians
1.1 Introduction
In games with a payoff-relevant parameter, players’ beliefs about this parameter,
as well as their beliefs about opponent beliefs about this parameter, are important
for predictions of play. The standard approach to restricting the space of beliefs
assumes that players share a common prior distribution.1 This assumption is
known to have strong implications, including that beliefs that are commonly
known must be identical (Aumann, 1976), and repeated communication of beliefs
will eventually lead to agreement (Geanakoplos and Polemarchakis, 1982). These
properties conflict not only with considerable empirical evidence of public and
persistent disagreement,2 but also with the more basic, day-to-day, experience that
1The related, stronger, notion of rational expectations assumes moreover that this common priordistribution is in fact the “true" distribution shared by the modeler.
2In financial markets, agents publicly disagree in their interpretations of earnings announcements(Kandel and Pearson, 1995), valuations of financial assets (Carlin et al., 2013), forecasts for inflation
3
people sometimes come to different conclusions given the same information.
As a consequence, the following questions arise: When is disagreement a feature
of agent’s beliefs, and how can this disagreement be predicted from the primitives
of the economic environment? Can we relax the common prior assumption to
accommodate (commonly known) disagreement in a structured way? Finally, when
are strategic predictions robust to relaxations of the common prior assumption?
Towards the first questions of modeling and predicting disagreement, I propose
a reformulation of incomplete information in which agents form beliefs by learning
from data. I take data be a random sequence of observations, drawn i.i.d. from an
exogenous distribution P, and define an inference rule to be any map from possible
datasets into distributions over the parameter space. (For example, we can think of
data as historical stock prices, and inference rules as maps from possible time-series
of stock returns to distributions over returns next period.)
This perspective on beliefs provides a way to rationalize disagreement—in
the absence of a “privileged" or “correct" inference rule, different interpretations
of common data is not only possible, but even natural.3 The key restriction I
impose to structure this approach is that while agents may learn from data using
different inference rules, they have common certainty in the predictions of a family
of plausible inference rules.4 This assumption is referred to as common inference.
In the main part of the paper, I additionally assume a condition on the family of
(Mankiw et al., 2004), forecasts for stock movements (Yu, 2011), and forecasts for mortgage loanprepayment speeds (Carlin et al., 2014). Agents publicly disagree also in matters of politics (Wiegel,2009) and climate change (Marlon et al., 2013).
3Indeed, this perspective has been taken in work by Al-Najjar (2009), Gilboa et al. (2013), andAcemoglu et al. (2015), among others, in various nonstrategic settings (see Section 9.3 for an extendedreview). I embed these ideas into an incomplete information game, and study their implications forstrategic behavior.
4Reflecting, for example, common cultural experiences or industry-specific norms.
4
inference rules (uniform consistency5) that implies that agents commonly learn the
true parameter (see Proposition 1).6 In this framework, complete information is
interpreted as a reduced form for agents having beliefs coordinated by an infinite
quantity of data.7
Towards the second question of robustness to the common prior assumption, I
propose a new robustness criterion for strategic predictions based in the quantity of
data that agents need to see. I define a sequence of incomplete information games,
called inference games, which are indexed by a quantity of (public) observations
n < •. In each of these games, agents observe n random observations, and form
beliefs satisfying common inference. As the quantity of data n tends to infinity,
this sequence of games (almost surely) converges to the game in which agents have
common certainty of the true parameter value. But for any n < •, agents have
different beliefs.
The main part of the paper (Sections 5 and 6) asks: Which solutions of the
limit complete information game persist (with high probability) in these finite-data
inference games? The key object of study is pn(a), the probability that an action
profile a is a solution given n observations. Section 5 characterizes which solutions
have the property that pn(a) ! 1 as n tends to infinity; these solutions are said
to be robust to inference. I find that Nash equilibria are robust to inference if and
5The property of uniform consistency is satisfied by many families of inference rules, includingany finite inference rule class, as well as certain classes of kernel density estimators with variablebandwidths, and certain classes of Bayes estimators with heterogeneous priors.
6This assumes implicitly that the unknown parameter can be identified in the data. In theproposed framework, disagreement may persist even given infinite quantities of data if the parameteris not identified.
7Recent papers have argued that agreement need not occur even in an infinite data limit. Forexample, Acemoglu et al. (2015) show that asymptotic beliefs need not agree when individuals areuncertain about signal distributions. I assume agreement given infinite data in the main part of thepaper to emphasize the question of when (sufficient) agreement occurs given finite data. In Section7.1, I show that this is a stronger assumption than is necessary for the main results.
5
only if they are strict (Theorem 1), and that the robustness of rationalizable actions
can be characterized using procedures of iterated elimination of strategies that are
never a strict best reply (Theorem 2).
In practice, agents only observe restricted amounts of data. Thus, strategic
behavior in the limit (as the quantity of data grows arbitrarily large) may not be the
most appropriate criterion for predictions in real economic environments. I suggest
next that we can provide a measure for how robust a solution is by looking at how
much data is required to support the solution. In Section 6, I provide lower bounds
on pn(a) for Nash equilibria (Proposition 2) and rationalizable actions (Proposition
3). For both solution concepts, the quantity of data required depends on several
features:
First, it depends on a cardinal measure of strictness of the solution. Say that an
action profile is a d-strict NE if each agent’s prescribed action is at least d better
than his next best action; and say that an action profile is d-strict rationalizable
if it can be rationalized by a chain of best responses, in which each action yields
at least d over the next best alternative. This parameter d turns out to determine
how much estimation error the solution can withstand—the higher the degree of
strictness (the larger the parameter d), the less data agents need to see.
Second, the quantity of data required depends on the “diversity" of the inference
rules. When agents have common knowledge of a smaller set of inference rules,
or when these inference rules diverge less in their predictions given common
data, then fewer observations are needed to coordinate beliefs. Conversely, lack
of common knowledge over what constitutes a reasonable interpretation of data
serves to prolong disagreement. Thus, this criterion provides a formal sense in
which the common prior assumption is less appropriate for predicting strategic
interactions across cultures, nations, and industries.
6
Finally, the quantity of data required depends on the “complexity" of the
learning problem. I do not provide a universal notion of complexity; instead, the
relevant determinants are seen to vary with the choice of inference rules. For
many classes of inference rules, an important determinant is dimensionality, and I
provide several concrete examples to illustrate this. In these cases, predictions in
the limit complete information game are less robust when payoffs are a function of
a greater number of covariates.
These comparative statics are, in my view, a key advantage to modeling beliefs
using the proposed framework. When agents learn from data, possibly using
different inference rules, then channels for disagreement emerge that are comple-
mentary to (and distinct from) the traditional channel of differential information.
In particular, the amount of common knowledge over how to interpret data, and
the “dimensionality" or “complexity" of the unknown parameter, are both cru-
cial to determining dispersion in beliefs. These sources for disagreement have
potentially new implications for policy design and informational design: for exam-
ple, summary statistics may facilitate coordination by reducing the complexity of
information (and thus, coordinating beliefs).
The final sections proceed as follows: Section 7 examines several modeling
choices made in the main text, and discusses the extent to which the main results
rely on these choices. In particular, I look at relaxations of uniform consistency
(Section 7.1), the introduction of private data (Section 7.2), and the introduction of
limit uncertainty (Section 7.3).
Section 8 surveys the related literature. This paper builds a bridge between
the body of work that studies the robustness of equilibrium and equilibrium
refinements (Fudenberg et al., 1988; Carlsson and van Damme, 1993; Kajii and
Morris, 1997; Weinstein and Yildiz, 2007), and the body of work that studies the
7
asymptotic properties of learning from data (Cripps et al., 2008; Al-Najjar, 2009;
Acemoglu et al., 2015).
Section 9 concludes.
1.2 Example
I begin by illustrating ideas with a simple coordination game, in which two farmers
decide simultaneously whether or not to adopt a new agricultural technology—for
example, a new pest-resistant grain. Continued production of the existing grain
results in a payoff of 12 . Transitioning alone to the new grain results in a payoff of
0, since neither farmer can individually finance the distribution and transportation
costs of this new grain. Finally, coordinated transition results in an unknown payoff
of q, which I will assume for simplicity takes the value q = 1 if crop yield is high,
and q = �1 if crop yield is low. The payoffs to this game are summarized below:
Adopt Not Adopt
Adopt q, q 0, 12
Not Adopt 12 , 0 1
212
When should we predict coordination on adoption of the new grain?
In the standard approach, all uncertainty about the new grain is described by a
state space W, and we assume that agents share a common prior over W. In the
absence of any private information about the new grain, this approach implies that
the two farmers have an identical belief over its future yield. But predicting yields
of a new kind of crop is not easy: crop yield is a function of many environmental
conditions—the soil structure, light exposure, quantity of rain, etc. I propose
an alternative perspective for modeling their beliefs to capture the role that this
complexity may play in determining disagreement between agents.
8
Learning from Data. In the proposed model, farmers predict the future yield
of the new crop based on how it previously fared in other environments. There
are r < • relevant environmental conditions (soil structure, light,. . . ). For this
example, let us assume that each condition takes a value in the interval [�c, c], and
that the true relationship between environmental conditions in [�c, c]r and crop
yields (high or low) is given by the following deterministic function:
p(x) =
8
>
<
>
:
High if x 2 [�c0, c0]r
Low otherwise8 x 2 [�c, c]r
where c0 2 (0, c). That is, crop yields are high under conditions in [�c0, c0]r, and
low otherwise. (See the figure below for an illustration of this relationship with
r = 2.)
high yield
low yield
light exposure
soil structure
c0
c
c0 c
Figure 1.1: Two relevant attributes (r = 2). Yield is high under environmental conditions in [�c0, c0]2, andlow otherwise. Farmers do not know the high yield region (shaded).
It is common knowledge that there is a (hyper-)rectangular region of favorable
environmental conditions (high yield), and a remaining region of unfavorable
conditions. The farmers do not, however, know the exact regions. Instead, they
9
observe the common data
(x1, p(x1)), . . . , (xn, p(xn)),
where xi are identically and independently drawn from a uniform distribution on
[�c, c]r. That is, farmers observe crop yields in n different sampled environments.
From this data, farmers infer a partitioning p that correctly classifies every obser-
vation, and use this inferred relationship to predict whether the yield will be low
or high in their region. For simplicity, let us take this region to be described by the
origin.
The key observation is that many rectangular partitionings will perfectly fit the
data; some of which may have different predictions at the origin. This creates room
for potential (rational) disagreement. (Figure 1.2 illustrates two such partitionings
based on an example dataset.) Say that a strategic prediction is robust if it holds
without further assumption regarding which partitioning either farmer infers, or
which partitioning he believes the other farmer to infer. When is coordinated
adoption robust?
low yield
high yield
?
c0
c
c0 c
light exposure
soil structure
Figure 1.2: Circles indicate low yields, and squares indicate high yields. The two rectangles identifypartitionings (predict high yield if x is contained within the rectangle, and low yield otherwise) that exactly fitthe data.
10
Robustness. Let us first clarify this criterion as follows. Every realization of
the data pins down a set of predictions, each of which is consistent with some
rectangular partitioning that exactly fits the data. Now, suppose only that this
set of predictions is common certainty—that is, both farmers put probability 1
on this set of predictions, believe that the other puts probability 1 on this set of
predictions, and so forth.8 This defines a set of possible (hierarchies of) beliefs that
either farmer could hold.
The key object of interest will be the probability that data is realized such that
Adopt is rationalizable given any belief in this set. This probability is a function of
the quantity of data n and of the number of conditions r; I will write it as p(n, r),
and refer to it as the plausibility of coordination on adoption.
Claim 1 For every quantity of data n � 1, number of environmental conditions r � 1,
and constants c, c0 2 R+
,
p(n, r) =✓
1 �
2✓
2c � c0
2c
◆n�✓
c � c0
c
◆n�◆r
.
Proof 1 See appendix.
This claim has several implications.
Observation 1 Coordinated adoption is more plausible when the quantity of data n is
larger.
8 Formally, define P to be the family of functions
p(x) = p⇣
x1, . . . , xr⌘
=
⇢
1 if xk 2 [ck, ck] for every k = 1, . . . , r0 otherwise 8 x 2 X ,
parametrized by the tuple (c1, c1, . . . , cr, cr) 2 R2r. This defines the class of all axis-aligned hyper-rectangles. Agents have common certainty in the set
{p(0) : p 2 P and p(xk) = p(xk) 8 k = 1, . . . n}.
11
From Claim 1, we see that p(n, r) is increasing in n. Indeed, p(n, r) ! 1 as
the quantity of data n tends to infinity (for fixed r). Thus, if farmers observe crop
yields in sufficiently many different environments, then coordinated adoption is
arbitrarily plausible.
Observation 2 Coordinated adoption is less plausible when the number of environmental
conditions r is larger.
From Claim 1, we see that p(n, r) is decreasing in r when n is sufficiently large.9
In fact, p(n, r) ! 0 as the number of environmental conditions r tends to infinity
(for fixed n). This suggests that coordinated adoption is less plausible when crop
yield depends purely on a single environmental condition, than when it depends
on a high-dimensional set of covariates.
Observation 3 Coordinated adoption is less plausible when the set of inference rules is
larger.
The probability p(n, r) weakly decreases for every n and r as we expand the
set of possible interpretations of the data. For example, suppose we assume that
it is common knowledge that the region of high crop yields is described by a
rotated rectangle, instead of axis-aligned rectangles as assumed above. This weakly
expands the room for possible disagreement, since there are datasets such as the
one in the figure below where every axis-aligned rectangle partitioning predicts
high yield at the origin, but some rotated rectangle partitioning predicts low yield.10
9A sufficient condition is
2✓
2c � c02c
◆n�✓
c � c0c
◆n<
1r
.
10Since the set of rotated rectangle partitionings includes the set of axis-aligned rectangle parti-tionings, clearly if every partitioning in the former set predicts that rebellion will be successful, thenevery partitioning in the latter set will as well.
12
high yield
low yield
c0
c
c0 c
Figure 1.3: Every axis-aligned rectangle partitioning predicts high yield at the origin, but there exists arotated rectangle partitioning that predicts low yield.
This suggests that coordinated adoption is more plausible when extrapolation from
past crop yields is coordinated by external means—for example, a common culture,
or a common set of heuristics.
Takeaways. Under the proposed approach, prediction of coordinated adoption
of the new grain is more plausible when agents have previously observed few
trial instances of the new crop, when the determinants of crop yield are high-
dimensional, and when there is not a common approach to extrapolating from
past yields. In the main body of the paper, I generalize the ideas in this example,
proposing a model in which agents have common certainty in the predictions of
an arbitrary class of inference rules, and a robustness criterion for equilibria and
rationalizable actions in all finite normal-form games.
1.3 Preliminaries and Notation
1.3.1 The game
Consider a set I of I < • agents and finite action sets (Ai)i2I . As usual, let
A = ’i2I Ai. The set of complete information (normal-form) games defined on
13
these primitives is the set of payoff matrices in U := R|I|⇥|A|. Let Q ✓ Rk be a
compact set of parameters and fix a metric d0 on Q such that (Q, d0) is complete
and separable. I will identify these parameters with payoff matrices under a map g
satisfying:
Assumption 1 g : Q ! U is a bounded and Lipschitz continuous embedding.11
This map g can be interpreted as capturing the known information about the
structure of payoffs. For example, in the game presented in Section 2, players know
that payoffs belonged to the parametric family of payoffs
a1 a2
a1 q, q 0, 12
a212 , 0 1
2 , 12
but do not know the value of q. Notice that identifying payoffs with parameters in
this way is without loss of generality, since we can always take Q := U and set g
to be the identity map. For clarity of exposition, I will sometimes write u(a, q) for
g(q)(a), or ui(a, q) for the payoffs to agent i. Finally, denote the true value of the
parameter by q⇤, and suppose that it is unknown.
Remark 1 It is also possible to interpret q as indexing a family of distributions over
payoffs; for example, q may be the mean of a normal distribution with a fixed variance. In
this case, g maps parameters in Q to expected payoffs under the distribution determined
by q.
1.3.2 Beliefs
Now let us define beliefs on Q.
11A map is an embedding if it is a homeomorphism onto its image.
14
Type space. For notational simplicity, consider first I = 2. Following Branden-
burger and Dekel (1993), recursively define
X0 = Q
X1 = X0 ⇥ (
D(X0))
...
Xn = Xn�1 ⇥ (
D(Xn�1))
and take T0 = ’•n=0 D(Xn). An element (t1, t2, . . . ) 2 T0 is a complete description
of beliefs over Q (describing the agent’s uncertainty over Q, his uncertainty over
his opponents’ uncertainty over Q, and so forth), and is referred to as a type.
This approach can be generalized for I agents, taking X0 = Q, X1 = X0 ⇥(
D(X0))I�1, and building up in this way. Mertens and Zamir (1985) have shown
that for every agent i, there is a subset of types T⇤i (that satisfy the property of
coherency12) and a function k⇤i : T⇤i ! D
�
Q ⇥ T⇤�i�
such that ki(ti) preserves the
beliefs in ti; that is, margXn�1ki(ti) = tn
i for every n. Notice that T⇤�i is used here to
denote the set of profiles of opponent types.
The tuple (T⇤i , k⇤i )i2I is known as the universal type space. Other tuples (Ti, ki)i2I
with Ti ✓ T⇤i for every i, and ki : Ti ! D(Q ⇥ T�i) represent alternative (smaller)
type spaces. Finally, let T⇤= T⇤
1 ⇥ · · ·⇥ T⇤I denote the set of all type profiles, with
typical element t = (t1, . . . , tI).
Remark 2 Types are sometimes modeled as encompassing all uncertainty in the game.
In this paper, I separate strategic uncertainty over opponent actions from structural
uncertainty over Q.
Topology on types. Let Tki = D(Xk�1) = D(Q ⇥ Tk�1
�i ) denote the set of possible
12margXn�2tn
= tn�1, so that (t1, t2, . . . ) is a consistent stochastic process.
15
k-th order beliefs for agent i.13 The uniform-weak topology on T⇤i , proposed in Chen
et al. (2010), is the metric topology generated by the distance
dUWi�
ti, t0i�
= supk�1
dk �ti, t0i� 8 ti, t0i 2 T⇤
i ,
where d0 is the metric defined on Q (see Section 2.1)14 and recursively for k � 1, dk
is the Prokhorov distance15 on D⇣
Q ⇥ Tk�1�i
⌘
induced by the metric max{d0, dk�1}on Q ⇥ Tk�1
i .
Common p-belief. Define W = Q ⇥ T⇤ to be the set of all “states of the world,"
such that every element in W corresponds to a complete description of uncertainty.
Following Monderer and Samet (1989), for every E ✓ W, let
Bp(E) := {(q, t) : ki(ti)(E) � p for every i} , (1.1)
describe the event in which every agent believes E ✓ W with probability at least p.
Common p-belief in the set E is given by
C p(E) :=
\
k�1[
Bp]
k(E).
The special case of common 1-belief is referred to in this paper as common certainty.
I use in particular the concept of common certainty in a set of first-order beliefs.
13Working only with types in the universal type space, it is possible to identify each Xk with itsfirst and last coordinates, since all intermediate information is redundant.
14In Chen et al. (2010), Q is finite and d0 is the discrete metric, but this construction extends to allcomplete and separable (Q, d0
).
15Recall that the Levy-Prokhorov distance r between measures on metric space (X, d) is defined
r(µ, µ0) = inf
n
d > 0 : µ(E) µ0 ⇣Ed⌘
+ d for each measurable E ✓ Xo
for all µ, µ0 2 D(X), where Ed= {x 2 X : infx02E d(x, x0) < d}.
16
For any F ✓ D(Q), define the event
EF :=�
(q, t) : margQ ti 2 F for every i
, (1.2)
in which every agent’s first-order belief is in F. Then, C1(EF) is the event in which
it is common certainty that every agent has a first-order belief in F. The set of
types ti given which agent i believes that F is common certainty is the projection of
C1(EF) onto T⇤
i .16 Since this set is identical across agents, I will refer to this simply
as the set of types with common certainty in F.
1.3.3 Solution concepts
Two solution concepts for incomplete information games are used in this paper.
Interim Correlated Rationalizability (Dekel et al., 2007). For every agent i and type
ti, set S0i [ti] = Ai, and define Sk
i [ti] for k � 1 such that ai 2 Ski [ti] if and only if ai 2
BRi
⇣
margQ⇥A�ip⌘
for some p 2 D(Q ⇥ T�i ⇥ A�i) satisfying (1) margQ⇥T�ip =
ki(ti) and (2) p⇣
a�i 2 Sk�1�i [t�i]
⌘
= 1, where Sk�1�i [t�i] = ’j 6=i Sk�1
j [t�j]. We can
interpret p to be an extension of belief ki(ti) onto the space D(Q ⇥ T�i ⇥ A�i), with
support in the set of actions that survive k � 1 rounds of iterated elimination of
strictly dominated strategies for types in T�i. For every i, define
S•i [ti] =
•\
k=0Sk
i [ti]
to be the set of actions that are interim correlated rationalizable for agent i of type
ti, or (henceforth) simply rationalizable.
Interim Bayesian Nash equilibrium. Fix any type space (Ti, ki)i2I . A strategy for
16Notice that when beliefs are allowed to be wrong (as they are in this approach), individualperception of common certainty is the relevant object of study. That is, agent i can believe that a set offirst-order beliefs is common certainty, even if no other agent in fact has a first-order belief in this set.Conversely, even if every agent indeed has a first-order belief in F, agent i may believe that no otheragent has a first-order belief in this set.
17
player i is a measurable function si : Ti ! Ai. The strategy profile (s1, . . . , sI) is a
Bayesian Nash equilibrium if
si(ti) 2 argmaxa2Ai
Z
Q⇥T�i
ui(ai, s�i(t�i), q)dki(ti) for every i 2 I and ti 2 Ti.
In a slight abuse of terminology, I will say throughout that action profile a is an
(interim) Bayesian Nash equilibrium if the strategy s with si(ti) = ai for every
ti 2 Ti is a Bayesian Nash equilibrium.
1.4 Learning from Data
What are agent beliefs over the unknown parameter (and over opponent beliefs
over the unknown parameter), and how are they formed? In this section, I describe
a framework in which beliefs are formed by learning from data.
Say that a dataset is any sequence of n observations z1, . . . , zn, sampled i.i.d.
from a set Z according to an exogenous distribution P. Throughout, I use Zn to
denote the random sequence corresponding to n observations, and zn to denote
a typical realization (or simply Z and z when the number of observations is not
important).
The key assumption of my approach is a restriction on the possible types that
agents can have following rationalization of the data. I begin by introducing a few
relevant concepts. Define a inference rule to be any map µ : z 7! µz from the set
of possible datasets17 to D(Q), the set of (Borel) probability measures on Q. Fix a
family of inference rules M.
Definition 1 For every dataset z, say that
Fz = {µz : µ 2 M} ✓ D(Q)
17[n�1Zn, where Zn denotes the n-fold Cartesian product of the set Z .
18
is the set of plausible first-order beliefs.
This is the set of all distributions over Q that emerge from evaluating the dataset
z using an inference rule in M. For every dataset z, define Tz to be the set of all
(interim) types for whom Fz is common certainty.18 That is, every type in Tz has
a plausible first-order belief, puts probability 1 on every other agent having a
plausible first-order belief, and so forth. The main restriction below, which I will
refer to from now on as common inference, assumes that following realization of
data z, every agent has a type in Tz.19
Assumption 2 (Common inference) Given any dataset z, every agent i has an (in-
terim) type ti 2 Tz.20
Several special examples for the set of inference rules M are collected below.
Example 1 (Bayesian updating with a common prior) Define µ to be the map that
takes any dataset z into the Bayesian posterior induced from the common prior and a
common likelihood function. Let M = {µ}. Then, for every dataset z, the set Fz consists of
the singleton Bayesian posterior induced from the common prior, and the set Tz consists of
the singleton type with common certainty in this Bayesian posterior.21
Example 2 (Bayesian updating with uncommon priors) Fix a set of prior distribu-
tions on Q and a common likelihood function. Every inference rule µ 2 M is identified
18See the end of Section 3.2 for a more formal exposition.
19The results in this paper follow without modification if we relax this assumption to commoncertainty in the convex hull of distributions in Fz. See Lemma 5.
20Notice that this paper takes an unusual interpretation of the ex-ante/interim distinction, whichdoes not explicitly invoke a Bayesian perspective. In this paper, the role of the prior is replaced by adata-generating process.
21That is, his first-order beliefs are given by this posterior, and he believes with probability 1 thathis opponents’ first-order beliefs are given by this posterior, and so forth.
19
with a prior distribution in this set, and maps the observed data to the Bayesian posterior
induced from this prior and the common likelihood.
Example 3 (Kernel regression with different bandwidth sequences) Let X ✓ R
be a set of attributes, which are related to outcomes in Q under the unknown map f : X !Q. Data is of form
(x1, y1), . . . , (xn, yn),
where every xk 2 X and every yk = f (xk). Suppose that the unknown parameter q⇤ is the
value of the function f evaluated at input x0.
Inference rules in M map the data to an estimate for q⇤ by first producing an estimated
function f , and then evaluating this function at x0. The approach for estimating f is as
follows: Fix a kernel function22 K : Rd ! R, and let hn ! 0 be a sequence of constants
tending to zero. Define fn,h : X ! Q to be the Nadaraya-Watson estimator
fn,h(x) =(nhn)
�1yk Ânk=1 K
�
(x � xk)/h1/dn�
(nhn)�1 Ânk=1 K
⇣
(x � xk)/h1/dn
⌘ ,
which produces estimates by taking a weighted average of nearby observations. This
describes an individual inference rule µ.
Now let H be a set of (bandwidth) sequences, each of which determines a different
level of “smoothing" applied to the data. Every inference rule µ 2 M is identified with a
sequence hn 2 H. Thus, M is a set of kernel regression estimators with different bandwidth
sequences.
Remark 3 Common inference does not impose an explicit relationship between agent beliefs
22K is measurable and satisfies the conditionsZ
RdK(x)dx = 1
supx2Rd
kK(x)k = k < •
20
and estimators. For example, all of the following are consistent with common inference:
• Every agent i is identified with an inference rule µi 2 M, and the sequence of
inference rules (µi)i2I is common knowledge.
• Every agent i is identified with an inference rule µi 2 M. Agent i knows his own
inference rule µi, but has a nondegenerate belief distribution over the inference rules
of other agents.
• Every agent i is identified with a distribution Pi on M, and draws an inference rule
at random from M under this distribution.
In the main part of this paper, I assume common inference (Assumption 2),
and ask what properties of beliefs and strategic behavior can be deduced from this
assumption alone.
1.4.1 When do agents commonly learn?
Let us begin by considering the property of common learning. Say that agents
commonly learn the true parameter q⇤ if, as the quantity of data increases, every
agent believes that it is approximate common certainty that the value of the
parameter is close to q⇤. It will be useful in this section to assign to every agent i a
map ti : z 7! tiz, which takes the realized data into a type in Tz.
Definition 2 (Common Learning) Agents commonly learn q⇤ if
limn!•
Pn⇣n
zn : tizn 2 C p
(Be(q⇤))
o⌘
= 1 8 i
for every p 2 [0, 1) and e > 0.
That is, for every level of confidence p and precision e, every agent eventually
lbelieves that the e-ball around the true parameter q⇤ is common p-belief. This
21
definition is modified from Cripps et al. (2008).23 When does common inference
imply that agents commonly learn the true parameter q⇤?
The following property of families of inference rules M will be useful:
Definition 3 (Uniform consistency.) The family of inference rules M is q⇤-uniformly
consistent if
supµ2M
dP(µZn , dq⇤) ! 0 a.s.
where dP is the Prokhorov metric on D(Q).
Remark 4 Say that an individual inference rule µ is q⇤-consistent if dP(µZn , dq⇤) ! 0
a.s. Uniform consistency is immediately satisfied by any finite family of q⇤-consistent
inference rules.
Recalling that dP metrizes the topology of weak convergence of measures, this
property says that for every µ 2 M, the distribution µZn (almost surely) weakly
converges to a degenerate distribution on q⇤. Moreover, this convergence is uniform
in inference rules.
Proposition 1 Every agent commonly learns the true parameter q⇤ if and only if M is
q⇤-uniformly consistent.
The structure of the argument is as follows. From Chen et al. (2010), we
know that convergence in the uniform-weak topology is equivalent to approximate
common certainty in the true parameter. Under q⇤-uniform consistency, it can be
shown that with probability 1, every sequence of types from {TZn}n�1 converges
in the uniform-weak topology to the type with common certainty in q⇤. This is,
23I take e > 0, so that agents believe it is approximate common certainty that the parameter isclose to q⇤; in Cripps et al. (2008), Q is finite, so agents believe it is approximate common certaintythat the parameter is exactly q⇤.
22
loosely, because possible k-th order beliefs are restricted to have support in the
possible (k � 1)-th order beliefs, so that in fact the rate of convergence of first-order
beliefs is a uniform upper-bound on the rate of convergence of beliefs at every
order. The details of this proof can be found in the appendix.
I assume in the remainder of the paper that M is q⇤-uniformly consistent, so
that beliefs converge as the quantity of data tends to infinity. The next part of the
paper transitions the focus to the stronger property of convergence of solution sets.
1.5 Robustness to Inference
Suppose that action ai is rationalizable for agent i (or, action profile a is an equilib-
rium) in the complete information game in which the true parameter q⇤ is common
certainty. Can we guarantee that action ai (action profile a) remains rationalizable
(an equilibrium) when payoffs are inferred from data, so long as agents observe a
sufficiently large number of observations?
1.5.1 Concepts
I will first introduce the idea of an inference game. For any dataset z, define G(z)
to be the incomplete information game with primitives I , (Ai)i2I , Q, g, and type
space
Tz = (Tzi , kz
i )i2I ,
where Tzi = Tz for every i, and kz
i is the restriction of ki (as defined in Section 3.2)
to Tzi .24 Notice that if Tz consists only of the type with common certainty of q⇤,
then this game reduces to the complete information game with payoffs given by
g(q⇤).
24Notice that kzi (T
zi ) = Tz
�i for every agent i, so this is a belief-closed type space.
23
We can interpret inference games as follows. Suppose the analyst knows only
that agents have observed data z, and that Assumption 2 (Common Inference)
holds. Then, the set of types that any player i may have is given by Tz. Recall
that as the quantity of data tends to infinity, this set Tz converges almost surely to
the singleton type with common certainty in q⇤. So, for large quantities of data,
inference games approximate the (true) complete information game. The question
of interest is with what probability solutions in this limit game persist in finite-data
inference games. This question is made precise for the solution concepts of Nash
equilibrium and rationalizability in the following way.
For any Nash equilibrium a of the limit complete information game, define
pNEn (a) to be the probability that data zn is realized, such that the strategy profile
(si)i2I , with si(ti) = ai 8 i 2 I , ti 2 Tzni
is a Bayesian Nash equilibrium. Analogously, define pRn (i, ai) to be the probability
that data zn is realized, such that
ai 2 S•i [ti] 8 ti 2 Tzn
i ;
that is, ai is rationalizable for agent i given any type in Tzni .
Definition 4 The rationalizability of action ai for player i is robust to inference if
pRn (i, ai) ! 1 as n ! •.
The equilibrium property of action profile a is robust to inference if
pNEn (a) ! 1 as n ! •.
What is the significance of robustness to inference? Suppose that action ai is
rationalizable when the true parameter is common certainty, and suppose moreover
24
that this property of ai is robust to inference. Then, the analyst believes with high
probability that ai is rationalizable for all types in the realized inference game, so
long as the quantity of observed data is sufficiently large. Conversely, suppose
that ai is rationalizable when the true parameter is common certainty, but that this
property of ai is not robust to inference. Then, there exists a constant d > 0 such
that for any finite quantity of data, the probability that ai fails to be rationalizable
for some type in the realized inference game is at least d. In this way, robustness to
inference is a minimal requirement for the rationalizability of ai to persist when
agents infer their payoffs from data. Analogous statements apply when we replace
rationalizability with equilibrium.
Let us first consider two trivial examples in which robustness to inference
imposes no restrictions. Consider the game with payoff matrix
a1 a2
a1 q⇤, q⇤ 0, 0
a2 0, 0 12 , 1
2
where q⇤ > 0. Is the equilibrium (a1, a1) robust to inference?
Example 4 (Trivial inference.) Let M consist of the singleton inference rule µ satisfying
µz = dq⇤ 8 z,
so that µz is always degenerate on the true value q⇤. Then, the set of plausible first-order
beliefs is Fz = {dq⇤} for every z, so that the true parameter q⇤ is common certainty with
probability 1. Thus, the inference game G(z) reduces to a complete information game, and
the equilibrium property of (a1, a1) is trivially robust to inference.
Example 5 (Unnecessary inference.) Let Q := [0, •). Then, action profile (a1, a1) is
a Nash equilibrium for every possible value of q 2 Q. Thus, the strategy profile that maps
25
any type of either player to the action a1 is a Bayesian Nash equilibrium for any beliefs that
players might hold over Q. In this way, the family of inference rules M is irrelevant, and
(a1, a1) is again trivially robust to inference.
The two following conditions rule out these cases in which inference is either trivial
or unnecessary.
Assumption 3 (Nontrivial Inference.) There exists a constant g > 0 such that
Pn({zn : dq⇤ 2 Int(Fzn)}) > g.
for every n sufficiently large.
This property says that for sufficient quantities of data, the probability that dq⇤
is contained in the interior of the set of plausible first-order beliefs Fzn is bounded
away from 0. Assumption 3 rules out the example of trivial inference, as well
as related examples in which every inference rule in M overestimates, or every
inference rule underestimates, the unknown parameter.25
To rule out the second example, I impose a richness condition on the image
of g. For every agent i and action ai 2 Ai, define S(i, ai) to be the set of complete
information games in which ai is a strictly dominant strategy for agent i; that is,
S(i, ai) :=�
u0 2 U : u0i(ai, a�i) > u0
i(a0i, a�i) 8 a0i 6= ai and 8 a�i
.
Assumption 4 (Richness.) For every i 2 I and ai 2 Ai, g(Q) \ S(i, ai) 6= ∆.
Under this restriction, which is also assumed in Carlsson and van Damme
(1993) and Weinstein and Yildiz (2007), every action is strictly dominant at some
parameter value. This condition is trivially satisfied if Q = U.
25This does not rule out sets of biased estimators. It may be that in expectation, every inferencerule in M overestimates the true parameter; Assumption 3 requires that underestimation occurs withprobability bounded away from 0.
26
In the subsequent analysis, I assume that the family of inference rules M
satisfies nontrivial inference, and the map g satisfies richness. These conditions are
abbreviated to NI and R, respectively.
1.5.2 Bayesian Nash Equilibrium
When is the equilibrium property of an action profile robust to inference? (From
now on, I will abbreviate this to saying that the action profile is itself robust to
inference.)
Theorem 1 Assume NI and R. Then, the equilibrium property of action profile a⇤ is robust
to inference if and only if it is a strict Nash equilibrium.
The intuition for the proof is as follows. Define UNEa⇤ to be the set of all payoffs u
such that a⇤ is a Nash equilibrium in the complete information game with payoffs
u. The interior of UNEa⇤ is exactly the set of payoffs u with the property that a⇤
is a strict Nash equilibrium given these payoffs. I show that as the quantity of
data tends to infinity, agents (almost surely) have common certainty in a shrinking
neighborhood of the true payoffs, so it follows that a⇤ is robust to inference if and
only if the true payoff function u⇤= g(q⇤) lies in the interior of UNE
a⇤ .
Proof 2 First, I show that the interior of the set UNEa⇤ is characterized by the set of complete
information games in which a⇤ is a strict Nash equilibrium.
Lemma 1 u 2 Int�
UNEa⇤�
if and only if action profile a⇤ is a strict Nash equilibrium in
the complete information game with payoffs u.
Proof 3 Suppose a⇤ is not a strict Nash equilibrium in the complete information game
with payoffs u. Then, there is some agent i and action ai 6= a⇤i such that
ui(ai, a⇤�i) � ui(a⇤i , a⇤�i).
27
Define ue such that uei (ai, a⇤�i) = ui(ai, a⇤�i) + e, and otherwise ue
i agrees with ui. Then,
ue 2 Be(u) for every e > 0, but ai is a strictly profitable deviation for agent i in response
to a⇤�i in the game with payoffs uei . So a⇤ is not an equilibrium in this game. Fix any
sequence of positive constants en ! 0. Then, uen ! u as n ! •, but uen /2 UNEa⇤ for
every n, so it follows that u /2 Int(UNEa⇤ ) as desired.
Now suppose that a⇤ is a strict Nash equilibrium in the complete information game
with payoffs u. Then,
e⇤ := infi2I
✓
ui(a⇤i , a⇤�i)� maxai 6=a⇤i
ui(ai, a⇤�i)
◆
> 0,
so u 2 Be⇤(u) ✓ UNEa⇤ , with Be⇤ nonempty and open. It follows that u 2 Int
�
UNEa⇤�
, as
desired.
Next, I show that a⇤ is robust to inference if and only if the true payoff function is in
the interior of the set UNEa⇤ .
Lemma 2 Let u⇤= g(q⇤). The equilibrium property of action profile a⇤ is robust to
inference if and only if u⇤ 2 Int�
UNEa⇤�
.
Proof 4 Define h(µ) =R
Q g(q)dµ to be the map from (first-order) beliefs µ 2 D(Q) into
the expected payoff function under µ.
h(μ) = ∫g(θ)dμ
μ
Δ(Θ)
u
U
beliefs on Θ (first-order) complete information games
Fz h(Fz)
Figure 1.4: The map h takes first-order beliefs µ into expected payoff functions.
Recall that every dataset z induces a set of plausible first-order beliefs Fz. The following
28
claim says that the equilibrium property of a⇤ is robust to inference if and only if with high
probability the set of expected payoffs h(FZn) is contained within UNEa⇤ as n ! •.
Claim 2 The equilibrium property of a⇤ is robust to inference if and only if
Pn⇣n
zn : h(Fzn) ✓ UNEa⇤o⌘
! 1 as n ! •.
Proof 5 I will show that the strategy profile (si)i2I with
si(ti) = a⇤i 8 i 2 I , 8 ti 2 Tz
is a Bayesian Nash equilibrium if and only if h(Fz) ✓ UNEa⇤ . From this, the above claim
follows immediately.
Suppose h(Fz) ✓ UNEa⇤ . Then, for any payoff function u 2 h(Fz),
ui(a⇤i , a⇤�i) � ui(ai, a⇤�i) 8 i 2 I and ai 6= a⇤i . (1.3)
Fix an arbitrary agent i and type ti with common certainty in h(Fz), and define µi :=
margQ ti to be his first-order belief. By construction, µi assigns probability 1 to h(Fz), so
it follows from (1.3) that
Z
Uui(a⇤i , a⇤�i)g. ⇤(µi) �
Z
Uui(ai, a⇤�i)g. ⇤(µi) 8 ai 6= a⇤i ,
where g⇤(µ) denotes the pushforward measure of µ under mapping g. Repeating this
argument for all agents and all types with common certainty in h(Fz), it follows that
(si)i2I is indeed a Bayesian Nash equilibrium.
Now suppose to the contrary that h(Fz)UNEa⇤ and consider any payoff function u that is
in h(Fz) but not in UNEa⇤ . Then, there exists some agent i for whom
ui(a⇤i , a⇤�i)� maxai 6=a⇤i
ui(ai, a⇤�i) < 0.
Let ti be the type with common certainty in g�1(u). Then, agent i of type ti has a profitable
29
deviation to some ai 6= a⇤i , so (si)i2I is not a Bayesian Nash equilibrium.
The final claim says that
Pn⇣n
zn : h(Fzn) ✓ UNEa⇤o⌘
! 1 as n ! •
if and only if u⇤ is in the interior of the set UNEa⇤ . This is, loosely, because h(Fzn) converges
to the singleton set {u⇤}; its proof is deferred to the appendix.
Claim 3 limn!• Pn ��zn : h(Fzn) ✓ UNEa⇤ �
= 1 if and only if u⇤ 2 Int(UNEa⇤ ).
The theorem directly follows from Lemmas 1 and 2.
1.5.3 Rationalizable Actions
When is the property of rationalizability of an action robust to inference? Theorem
1 suggests that the corresponding condition is strict rationalizability in the limit
complete information game. This intuition is roughly correct, but subtleties in the
procedure of elimination are relevant, and the theorem below will rely on two
different such procedures.
First, recall the usual definition for strict rationalizability, introduced in Dekel
et al. (2006). For every agent i and type ti, set R1i (ti) = Ai. Then, recursively define
Rki (ti), for every k � 2, such that ai 2 Rk
i [ti] if and only if
Z
Q⇥T�i⇥A�i
�
ui(ai, a�i, q)� ui(a0i, a�i, q)�
p. > 0 8 a0i 6= ai (1.4)
for some distribution p 2 D(Q ⇥ T�i ⇥ A�i) satisfying (1) margQ⇥T�ip = kti ,
and (2) p⇣
a�i 2 Rk�1�i [t�i]
⌘
= 1. That is, an action survives the k-th round of
elimination only if it is a strict best response to some distribution over opponent
strategies surviving the (k � 1)-th round of elimination. Let
R•i [ti] =
•\
k=0Rk
i [ti]
30
be the set of player i actions that survive every round of elimination. Define tq⇤ to
be the type with common certainty in the true parameter q⇤. I will say that action
ai is strongly strict-rationalizable if ai 2 R•i [tq⇤ ], where strongly is used to contrast
with the definition below.
Notice that in this definition, every action that is never a strict best response (to
surviving opponent strategies) is eliminated at once. This choice has consequences
for the surviving set, since elimination of strategies that are never a strict best
response is an order-dependent process. Following, I introduce a new procedure,
in which actions are eliminated (at most) one at a time.
For every agent i, let W1i := Ai. Then, for k � 2, recursively remove (at most)
one action in Wki that is not a strict best reply to any opponent strategy a�i with
support in Wk�1�i . That is, either the set difference Wk
i �Wk+1i is empty, or it consists
of a singleton action ai where there does not exist any a�i 2 D⇣
Wk�1�i
⌘
such that
ui(ai, a�i) > ui(a0i, a�i) 8 a0i 6= ai.
That is, ai is not a strict best reply to any distribution over surviving opponent
actions. Let
W•i =
\
k�1Wk
i
be the set of player i actions that survive every round of elimination, and say that
any set W•i constructed in this way survives an order of weak strict-rationalizability.
Define W•i to be the intersection of all sets W•
i surviving an order of weak strict-
rationalizability. I will say that an action ai is weakly strict-rationalizable if ai 2 W•i .26
Theorem 2 Assume NI and R. Then, the rationalizability of action a⇤i for agent i is
26The choice of weak to describe the latter procedure, and strong to describe the former, is explainedby Claim 6 (see Appendix B), which says that an action is strongly strict-rationalizable only if it isweakly strict-rationalizable.
31
robust to inference if a⇤i is strongly strict-rationalizable, and only if a⇤i is weakly strict-
rationalizable.
Remark 5 If there are two players, then the theorem above can be strengthened as follows:
Assume NI and R. Then, the rationalizability of action a⇤i for agent i is robust to inference
if and only if a⇤i is weakly strict-rationalizable.
Remark 6 The existence of actions that are strongly strict-rationalizable, but not weakly
strict-rationalizable, occurs only for a non-generic set of payoffs.27 See the discussion
preceding Figure 1.4 for a characterization of these intermediate cases.
Remark 7 Rationalizable actions that are robust to inference need not exist. For example,
in the degenerate game
a3 a4
a1 0, 0 0, 0
a2 0, 0 0, 0
all actions are rationalizable, but none are robust to inference.
Remark 8 Why is refinement obtained, in light of the results of Weinstein and Yildiz
(2007)? The key intuition is that the negative result in Weinstein and Yildiz (2007) relies on
construction of tail beliefs that put sufficient probability on payoff functions with dominant
actions. But under common inference, it is common certainty that every player puts low
probability on “most" payoff functions. So, with high probability, contagion from “far-off"
payoff functions with a dominant action cannot begin.
A second explanation for why refinement is obtained is the following. One can show
that the perturbations considered in this paper are a subset of perturbations in the uniform-
weak topology, which is finer than the topology used in Weinstein and Yildiz (2007). In
27The set of such payoffs is nowhere dense in the Euclidean topology on U.
32
particular, the sequences of types used to show failure of robustness in Weinstein and Yildiz
(2007) do not converge in the uniform-weak topology.
The broad structure of the proof follows that of Theorem 1, with several new
complications that I discuss below. Recall that as the quantity of data tends to
infinity, agents have common certainty in a (shrinking) neighborhood of the true
payoffs. Thus, a⇤i is robust to inference if and only if common certainty in a
sufficiently small neighborhood of the true payoffs u⇤ implies that the action a⇤i is
rationalizable for player i.
A necessary condition for robustness to inference. In analogy with the set UNEa⇤ ,
define URa⇤i
to be the set of all complete information games in which a⇤i is rational-
izable.28 Clearly, if u⇤ is on the boundary of this set, then common certainty in a
neighborhood of u⇤ (no matter how small) cannot guarantee rationalizability of a⇤i .
Therefore, a necessary condition for robustness to inference is that u⇤ must lie in
the interior of URa⇤i
. The first lemma says that the interior of URa⇤i
is characterized by
the set of actions that survive every process of weak strict-rationalizability.
Lemma 3 u 2 Int⇣
URa⇤i
⌘
if and only if a⇤i 2 W•i in the complete information game with
payoffs u.
Why is Int⇣
URa⇤i
⌘
characterized by this particular notion of strict rationalizability,
and not by others? I provide an example that illustrates why various other natural
candidates are not the right notion, and follow this with a brief intuition for the
proof of Lemma 3.
28Here I abuse notation and write URa⇤i
instead of URi,a⇤i
.
33
Consider the payoff matrices below:
(u1)
a3 a4
a1 1, 0 1, 0
a2 0, 0 0, 0
(u2)
a3 a4
a1 1, 0 1, 0
a2 0, 0 1, 0
If all strategies that are never a strict best reply are eliminated simultaneously
(corresponding to strong strict-rationalizablity), then a1 does not survive in either
game.29 If the criterion is survival of any process of iterated elimination of strategies
that are never a strict best reply, then a1 survives in both games.30 But u1 is in the
interior of URa1
, while u2 is not,31 so neither of these notions provides the desired
differentiation.
Now, I provide a brief intuition for the “only-if" direction of Lemma 3. Suppose
that action a⇤i fails to survive some iteration of weak strict-rationalizability. Then,
there is some sequence of sets (Wki )k�1 satisfying the recursive description in the
29In the first round, both actions are eliminated for player 2, so a1 trivially cannot be a best replyfor player 1 to any surviving player 2 action.
30For example, the order of elimination
a3 a4a1 1, 0 1, 0a2 0, 0 0, 0
�!a3 a4
a1 1, 0a2 0, 0
�!a3 a4
a1 1, 0a2
in the first game, and
a3 a4a1 1, 0 1, 0a2 0, 0 1, 0
�!a3 a4
a1 1, 0a2 0, 0
�!a3 a4
a1 1, 0a2
in the second.
31Action a1 remains rationalizable in every complete information game with payoffs close to u1,so u1 2 Int(UR
a1). In contrast, for arbitrary e � 0, the payoff matrix
(u02)
a3 a4a1 1, 0 1, ea2 0, 0 1 + e, e
is within e of u2 (in the sup-norm), but a1 is not rationalizable in the complete information gamewith payoffs u0
2. So, the payoff u2 lies on the boundary of URa1
.
34
definition of weak strict-rationalizability, such that a⇤i /2 WKi for K < •. To show
that a⇤i is not robust to inference, I construct a sequence of payoffs un ! u with the
property that a⇤i fails to be rationalizable in every complete information game un,
for n sufficiently large. The key feature of this construction is translation of weak
dominance under the payoffs u to strict dominance under the payoffs un. This is
achieved by iteratively increasing the payoffs to every action that survives to Wk+1i
by e, thus breaking ties in accordance with the selection in (Wki )k�1.
So, a necessary condition for robustness to inference is weak strict-rationalizability.
Next, I show that a sufficient condition is strong strict-rationalizability, and explain
the gap between these two conditions.
A sufficient condition for robustness to inference. The reason why weak strict-
rationalizability is not sufficient is because, unlike the analogous case for equilib-
rium, common certainty in the set URa⇤i
does not imply rationalizability of a⇤i .32 In
fact, even if beliefs are concentrated on a (vanishingly) small neighborhood of a
payoff function in Int(URa⇤i), it may be that a⇤i fails to rationalizable. See Appendix
D for such an example.
Remark 9 This example shows moreover that weak strict-rationalizibility is not lower
hemi-continuous in the uniform-weak topology. Since strong strict-rationalizability is lower-
32A simple example is the following. Consider the following two payoffs:
(u1)
a3 a4a1 1 0a2
34
34
(u2)
a3 a4a1 0 1a2
34
34
Action a1 is rationalizable for agent 1 in both complete information games, so u1, u2 2 URa1
. Butaction a1 is strictly dominated by action a2 if each game is equally likely, since in expectation payoffsare
a3 a4
a112
12
a234
34
35
hemicontinuous in the uniform-weak topology (Dekel et al., 2006; Chen et al., 2010), this
example suggests that subtleties in the definition of strict rationalizability have potentially
large implications for robustness.
The reason why common certainty of a shrinking set in URa⇤i
need not imply
rationalizability of a⇤i is because the chain of best responses rationalizing action
a⇤i can vary across URa⇤i
. In particular, it may be that the true payoffs u⇤ lie on
the boundary between two open sets of payoff functions, each with different
families of rationalizable actions. See Figure 1.4 below for an illustration. These
cases are problematic because even though a⇤i is rationalizable when agents (truly)
have common certainty in any payoff functions close to u⇤, it may fail to be
rationalizable if agents (mistakenly) believe that payoff functions on different sides
of the boundary are common certainty.
On the other hand, if a⇤i is strongly strict-rationalizable, then it can be justified
by a chain of strict best responses that remain constant on some neighborhood of
u⇤. It can be shown in this case that common certainty in a vanishing neighborhood
of u⇤ indeed implies rationalizability of a⇤i . This provides the sufficient direction.
u1u2
u3
U
URa⇤i
Figure 1.5: The set URa⇤i
is partitioned such that every agent’s set of rationalizable actions is constant acrosseach element of the partition. There are three cases: (1) if u⇤ is on the boundary of UR
a⇤i(e.g. u1), then a⇤i is not
robust to inference; (2) if u⇤ is in the interior of URa⇤i
, and moreover in the interior of a partition element (u2),then a⇤i is certainly robust to inference; (3) if u⇤ in the interior of UR
a⇤i, but not in the interior of any partition
element (u3), then a⇤i may not be robust to inference. See Appendix D for an example of the last case.
36
1.6 How Much Data do Agents Need?
Theorems 1 and 2 characterize the persistence of equilibria and rationalizable
actions given sufficiently large quantities of data. But in practice, the quantity of
data that agents observe about payoff-relevant parameters is limited. Robustness
to inference is meaningful only if convergence obtains in the ranges of data that
we can reasonably expect agents to observe. Therefore, I ask next, how much data
is needed for reasonable guarantees on persistence?
This section addresses this question by providing lower bounds for pRn (i, ai) and
pNEn (a). These bounds suggest a second, stronger criterion for equilibrium selection,
based in the quantity of data needed to reach a desired threshold probability. These
bounds also highlight the importance of various features of the solution and the
game, including the degree of strictness of the solution, and the complexity of the
inference problem.
1.6.1 Bayesian Nash Equilibrium
The following is a measure for the “degree of strictness" of a Nash equilibrium in
the complete information game with payoffs u⇤= g(q⇤). For any d � 0, say that
action profile a is a d-strict Nash equilibrium33 if
u⇤i (ai, a�i)� max
a0i 6=aiu⇤
i (a0i, a�i) > d 8 i 2 I .
33Replacing the strict inequality > with a weak inequality �, this definition reverses the morefamiliar concept of e-equilibrium, which requires that
u⇤i (ai, a�i)� max
a0i 6=aiu⇤
i (a0i , a�i) � �e 8 i, where e � 0.
The concept of e-equilibrium was introduced to formalize a notion of approximate Nash equilibria(violating the equilibrium conditions by no more than e). I use d-strict equilibrium to provide acardinal measure for the strictness of a Nash equilibrium (satisfying the conditions with d to spare).
37
Every strict Nash equilibrium a⇤ admits the following cardinal measure of strict-
ness:
dNEa = sup {d : a is a d-strict NE} ,
which represents the largest d for which a is a d-strict NE. This parameter describes
the amount of slack in the equilibrium a—action profile a remains an equilibrium
on at least a dNEa -neighborhood of the payoff function u⇤.
Proposition 2 Suppose a⇤ is a d-strict Nash equilibrium for some d � 0. Then, for every
n � 1,
pNEn (a⇤) � 1 � 2
dNEa⇤
EPn
supµ2M
kh(µZn)� u⇤k•
!
(1.5)
where h(n) =R
Q g(q)dn for every n 2 D(Q).
Remark 10 Uniform consistency of M implies that supµ2M kh(µZn)� u⇤k• ! 0 a.s.,
so for any strict Nash equilibrium a⇤, the bound in Proposition 2 converges to 1.34 This
implies also that the gap between pNEn (a⇤) and its lower bound in (1.5) converges to 0 as
the quantity of data n tends to infinity.
How can we interpret this bound? By assumption, action profile a⇤ is an
equilibrium in the complete information game with payoffs u⇤. But when n < •,
agents may have heterogenous and incorrect beliefs. The probability with which
a⇤ persists as an equilibrium under these modified beliefs is determined by two
components:
1 � 2dNE
a⇤|{z}
(1)
EPn
supµ2M
kh(µZn)� u⇤k•
!
| {z }
(2)
.
First, it depends on the fragility of the solution a⇤ to introduction of heterogeneity
and error in beliefs. This is reflected in component (1): the bound is increasing
34This follows from continuity of the map h (see Lemma 4).
38
in the parameter dNEa⇤ . Intuitively, equilibria that are “stricter" persist on a larger
neighborhood of the true payoffs u⇤. It turns out that common certainty in the
dNEa⇤ /2-neighborhood of u⇤ is sufficient to imply that a⇤ is an equilibrium (see
Lemma 11).
Second, the probability pNEn (a⇤) depends on the expected error in beliefs. This is
reflected in the second component: kµZn � u⇤k• is the (random) error in estimated
payoffs using a fixed inference rule µ 2 M; so supµ2M kµZn � u⇤k• is the (random)
supremum error across inference rules in M; and (2) gives the expected supremum
error across inference rules in M. As n tends to infinity, this quantity tends to
zero,35 but the speed at which inference rules in M uniform converge to the truth
is determined by the “diversity" of inference rules in M, and by the statistical
complexity of the learning problem.
This first feature, diversity, can be thought of as a property of the relationship
between inference rules in M to each other. Holding fixed the rate at which
individual inference rules learn, the lower bound is lower when inference rules in
M jointly learn slower. How this occurs, and how much effect this can have on the
analyst’s confidence pNEn (a⇤), is discussed in detail in Section 6.3.
This second feature, complexity, can be thought of as a property of the rela-
tionship between inference rules in M and the data. For example, in Section 2,
the probability p(n, r) decreases in the dimensionality of the data. More generally,
when finite-sample bounds for the uniform rate of convergence of inference rules
in M are available, they can plugged into the lower bound in Proposition 2. This
technique is illustrated below for a new set of inference rules M.
Example 6 Let us consider agents who use ordinary least-squares regression to estimate a
relationship between p covariates and a real-valued outcome variable. An observation is a
35Since M is q⇤-uniform consistency and h is continuous.
39
tuple (xi, yi) 2 Z := Rp ⇥ R, where
yi = xTi b + ei
with xi ⇠i.i.d. N (0, Ip), ei ⇠i.i.d. N (0, 1), and xi and ei independent. Suppose that the
first coordinate of the coefficient vector b, denoted b1, is payoff-relevant. That is, Q = R,
and the true parameter is q⇤ = b1.
Recall that the least-squares estimate for the coefficient vector b is
b = (XTX)
�1XTY
where X is the matrix whose i-th row is given by xi, and Y is the matrix whose i-th row
is given by yi. Fix a sequence of constants fn that tends to 0. Let M consist of the set
of inference rules that map the data into a distribution with support in Bfn(b1). That is,
every inference rule maps the data into distribution with support in the fn-neighborhood of
the least-squares estimate for b1.
Corollary 1 Suppose the data-generating process and family of inference rules is as
described in the above example. Then, for every complete information game and d-strict
Nash equilibrium a⇤ (with d � 0),
pNEn (a⇤) � 1 � 2K
dNEa⇤
�
s2 p�p
n +
pp�
+ f2n�
and K is the Lipschitz constant36 of the map g : Q ! U.
Thus, the lower bound is decreasing in the number of covariates p (i.e. the analyst is less
confident in predicting a⇤ when the number of covariates is larger). The proof can be found
in the appendix.
36Assuming the sup-norm on U and the Euclidean norm on Q.
40
1.6.2 Rationalizable Actions
We can now repeat the previous exercise for the solution concept of rationalizability.
The following is a measure for the “degree" of rationalizability of an action ai in the
complete information game with payoffs u⇤. For any d � 0, say that the family of
sets (Rj)j2I is closed under d-best reply if for every agent j and action aj 2 Rj, there
is some distribution a�j 2 D(R�j) such that
u⇤j (aj, a�j) > u⇤
j (a0j, a�j) + d 8 a0j 6= aj. (1.6)
Say that action ai is d-strict rationalizable for agent i if there exists some family
(Rj)j2I , with ai 2 Ri, that is closed under d-best reply. Every strictly rationalizable
action ai admits the following cardinal measure of the degree of strictness:
dRai= sup{d : ai is d-strict rationalizable}.37
This parameter describes the amount of slack in the rationalizability of action
ai—that is, action ai remains rationalizable for agent i on at least a dRai
-neighborhood
of the payoff function u⇤.
Remark 11 This definition is equivalent to requiring that ai survive a more general version
of strong strict-rationalizability, where the inequality in (1.4) is replaced by
Z
Q⇥T�i⇥A�i
�
ui(ai, a�i, q)� ui(a0i, a�i, q)�
p. > d 8 a0i 6= ai,
so that ai yields at least d more than the next best action given the distribution p.38
37I abuse notation here and write dRai
instead of dRi,ai
. Again, this parameter is defined only if ai isd-strict rationalizable for some d � 0.
38A similar procedure is introduced in Dekel et al. (2006). The above definition makes thefollowing modifications: first, the inequality to be strict; second, d appears on the right-hand side ofthe inequality, instead of �d.
41
Proposition 3 Suppose action a⇤i is d-strict rationalizable for some d > 0. Then, for every
n � 1,
pRn (i, a⇤i ) � 1 � 2
dRa⇤i
EPn
supµ2M
kh(µZn)� u⇤k•
!
,
where h(n) =R
g(q)dn for any n 2 D(Q).
Proof 6 See appendix.
Again, we see that the lower bound is increasing in the “strictness" of the
solution, as measured through the parameter dRa⇤i
, and in the speed at which
expected payoffs using inference rules in M uniformly converge to the true payoffs
u⇤. As before, when finite-sample bounds are available, they can be used to derive
closed-form expressions for this bound.
Corollary 2 Suppose the data-generating process and family of inference rules are as
described in Example 6. Then, for every complete information game, agent i, and d-strict
rationalizable action a⇤i (with d � 0),
pRn (i, a⇤i ) � 1 � 2K
dRa⇤i
�
s2 p�p
n +
pp�
+ f2n�
where K is the Lipschitz constant39 of the map g : Q ! U.
1.6.3 Diversity across Inference Rules in M
I conclude this section with a brief discussion regarding the dependence of pNEn (a)
and pRn (i, ai) on the diversity across inference rules in M. To isolate this effect from
properties of individual inference rules, let us fix the marginal distributions of
µ(Zn) for every µ 2 M, and vary the joint distribution of the random variables
39Assuming the sup-norm on U and the Euclidean norm on Q.
42
(µ(Zn))µ2M. Proposition 4 below provides upper and lower bounds for pNEn (a)
and pRn (i, ai). These bounds can be understood from the following simple example.
Example 7 Recall the game from Section 2 with payoffs
A NA
A q, q 0, 12
NA 12 , 0 1
2 , 12
where Q = {�1, 1}. Fix a quantity of data n < •, and suppose that M consists of two
inference rules µ1, µ2 with marginal distributions
µ1(Zn) ⇠ 14
d�1 +34
d1
µ2(Zn) ⇠ 34
d�1 +14
d1
That is, with probability 14 , data zn is generated such that µ1(zn) is degenerate on �1, and
with probability 34 , data is generated such that µ1(zn) is degenerate on 1. (The distribution
of µ2(Zn) is interpreted similarly.) Given these distributions, what are the largest and
smallest possible values of pNEn ((A, A))?
First observe that action profile (A, A) is an equilibrium if and only if data zn is realized
such that µ1(zn) = µ2(zn) = d1. Otherwise, A is strictly dominated for the agent with
first-order belief d�1. At one extreme, µ1(Zn) and µ2(Zn) may be correlated such that
µ1(zn) = d1 for every dataset zn where µ2(zn) = d1. Then,
pNEn (a) = Pr({zn : µ2(zn) = d1}) = 1
4.
If instead, µ1 and µ2 are independent, then
pNEn (a) = Pr({zn : µ1(zn) = d1})Pr({zn : µ2(zn) = d1}) =
✓
14
◆✓
34
◆
<14
.
This quantity is further reduced if µ2(zn) = d1 implies that µ1(zn) = d�1, in which case
43
pNEn (a) = 0.
These observations can be generalized as follows for arbitrary finite M. For
every inference rule µ and quantity of data n � 1, define
pNEµ,n (a) := Pr
⇣
h(
µZn) 2 UNEa
⌘
(1.7)
This is the probability that action profile a is a Nash equilibrium if every agent
has beliefs degenerate on the prediction of inference rule µ. Define pRµ,n(i, ai)
analogously, replacing UNEa with UR
aiin (1.7).
Proposition 4 Suppose M is finite, and the marginal distributions (µ(Zn))µ2M are fixed.
Then,
1 � µ2M
pNEµ,n (a) pNE
n (a) 1 � minµ2M
pNEµ,n (a)
and
1 � µ2M
pRµ,n(i, ai) pR
n (i, ai) 1 � minµ2M
pRµ,n(i, ai).
The upper bound corresponds to co-monotonic random variables, and the lower
bound, when attainable, corresponds to counter-monotonic random variables. In
the co-monotonic case, different inference rules err in inference of payoffs on the
same sets of data, whereas in the counter-monotonic case they err on datasets that
are as non-overlapping as possible.
1.7 Extensions
The following section provides brief comment on and extension to various inference
ruleing choices made in the main framework.
44
1.7.1 Misspecification
Proposition 1 shows that q⇤-uniform consistency is both necessary and sufficient
for common learning, and I assume in the remainder of the paper that the family
of inference rules M is q⇤-uniformly consistent. But continuity in equilibrium
sets (and rationalizable sets) does not require common learning. Can we obtain
Theorems 1 and 2 under a weakening of this property?
In fact, it is neither necessary that individual inference rules are consistent,
nor necessary that inference rules uniformly converge. I introduce a relaxation of
uniform consistency below.
Definition 5 (Almost q⇤-uniform consistency.) For any e � 0, say that the class of
inference rules M is (e, q⇤)-uniformly consistent if
limn!•
supµ2M
dP(µ(Zn), dq⇤) e a.s.
where dP is the Prokhorov metric on D(Q).
This says that a class of inference rules is almost q⇤-uniformly consistent if the
set of plausible first order beliefs converges40 almost surely to a neighborhood
of the true parameter. Notice that uniform consistency is nested as the e = 0
case. The proofs of Theorems 1 and 2 are easily adapted to show the following
result. (In reading this, recall that if M is (e, q⇤)-uniformly consistent, then it is
also (e0, q⇤)-uniformly consistent for every e0 > e.)
Proposition 5 Assume NI and R.
1. The rationalizability of action a⇤i is robust to inference if dRa⇤i> 0 and M is
⇣
dRa⇤i
, q⇤⌘
-
uniformly consistent.
40In the Hausdorff distance induced by dP.
45
2. The equilibrium property of a⇤ is robust to inference if dNEa⇤ > 0 and M is
�
dNEa⇤ , q⇤
�
-
uniformly consistent.
1.7.2 Private Data
In the main text, I assume that agents observe a common dataset. How do the main
results change if agents observe private data? Cripps et al. (2008) have shown that
if Z is unrestricted, then common learning may not occur even if |M| = 1 (so that
M contains a single inference rule). It is also known that strict Nash equilibria need
not be robust to higher-order uncertainty about opponent data (see e.g. Carlsson
and van Damme (1993), Kajii and Morris (1997)). Thus, extension to private data
requires restrictions on beliefs over opponent data that are beyond the scope of this
paper.
In the simplest extension, however, we may suppose that players observe
different datasets (zi)i2I , independently drawn from the same distribution, but
each has an (incorrect) degenerate belief that all opponents have seen the same data
that he has. Then, Theorems 1 and 2 hold as stated, and the bounds in Propositions
2 and 3 are revised as follows.
Proposition 6 Suppose a⇤ is a d-strict Nash equilibrium for some d > 0. Then, for every
n � 1,
pNEn (a⇤) �
1 � 2dNE
a⇤EPn
supµ2M
kh(µZn)� u⇤k•
!!I
where I is the number of players. Suppose a⇤i is d-strict rationalizable for some d. Then, for
every n � 1,
pRn (i, a⇤i ) �
1 � 2dR
a⇤i
EPn
supµ2M
kh(µZn)� u⇤k•
!!I
.
46
1.7.3 Limit Uncertainty
In the main text, I assume that agents learn the true parameter as the quantity of
data n tends to infinity, so that the limit game is a complete information game. This
approach can be extended such that the limit game has incomplete information. Fix
a distribution n 2 D(Q)—a limit common prior—and rewrite uniform consistency
as follows:
Definition 6 (Limit Common Prior.) The set of inference rules M has a limit common
prior n if
supµ2M
dP(µ(Zn), n) ! 0 a.s.
where dP is the Prokhorov metric on D(Q).
Then, taking u⇤ := h(n) to be the expected payoff under n, all the results in
Section 5 follow without modification.
1.8 Related Literature
This paper makes a connection between the literature regarding robustness of
equilibrium to specification of agent beliefs, and the literature that studies agents
who learn from data. I discuss each of these literatures in turn.
1.8.1 Robustness of equilibrium and equilibrium refinements
The following question has been the focus of an extensive literature: Suppose an
analyst does not know the exact game that is being played. Which solutions in his
inference rule of the game can be guaranteed to be close to some solution in all
nearby games?
47
Early work on this question considered “nearby" to mean complete information
games with close payoffs (Selten, 1975; Myerson, 1978; Kohlberg and Mertens, 1986).
Fudenberg et al. (1988) proposed consideration of nearby games in which players
themselves have uncertainty about the true game. This approach of embedding
a complete information game into games with incomplete information has since
been taken in several papers under different assumptions on beliefs. For example:
Carlsson and van Damme (1993) consider a class of incomplete information games
in which beliefs are generated by (correlated) observations of a noisy signal of
payoffs of the game. Kajii and Morris (1997) study incomplete information games in
which beliefs are induced by general information structures that place sufficiently
high ex-ante probability on the true payoffs.
I ask which solutions of a complete information game persist in nearby incom-
plete information games, where the definition of nearby that I use differs from
the existing literature in the following ways: First, I place a strong restriction on
(interim) higher-order beliefs, which has the consequence that agents commonly
learn the true parameter. This contrasts with Carlsson and van Damme (1993)
and Kajii and Morris (1997), in which—even as perturbations become vanishingly
small—agents consider it possible that other agents have beliefs about the un-
known parameter that are very different from their own. In particular, failures of
robustness due to standard contagion arguments do not apply in my setting; thus,
I obtain rather different robustness results.41
Second, while the restriction I place on interim beliefs is stronger in the sense
41For example, the construction of beliefs used in Weinstein and Yildiz (2007) to show failure ofrobustness (Proposition 2) relies on construction of tail beliefs that place positive probability on anopponent having a first-order belief that implies a dominant action. A similar device is employedin Kajii and Morris (1997) to show that robust equilibria need not exist (see the negative examplein Section 3.1). These tail beliefs are not permitted under my approach. When the quantity of datais taken to be sufficiently large, it is common certainty (with high probability) that all players havefirst-order beliefs close to the true distribution, so the process of contagion cannot begin.
48
described above, I do not require that these beliefs are consistent with a common
prior. This allows for common knowledge disagreement, which is not permitted in
either Carlsson and van Damme (1993) or Kajii and Morris (1997).
Finally, the class of perturbations that I consider are motivated by a learning
foundation (this aspect shares features with Dekel et al. (2004) and Esponda (2013),
but agents in this paper learn about payoffs only, and not actions). I interpret the
sequence of interim types as corresponding to learning from a fixed number of
observations. This motivates a departure from the literature in studying solution
sets not just in nearby games (large n), but also in far games (small n). In particular,
I suggest that we can characterize the degree of robustness by looking at the
persistence of solutions in small-n games.
1.8.2 Role of higher-order beliefs
A related literature studies the sensitivity of solutions to specification of higher-
order beliefs. Early papers in this literature (Mertens and Zamir, 1985; Branden-
burger and Dekel, 1993) considered types to be nearby if their beliefs were close up
to order k for large k (corresponding to the product topology on types). Several au-
thors have shown that this notion of close leads to surprising and counterintuitive
conclusions, in particular that strict equilibria and strictly rationalizable actions are
fragile to perturbations in beliefs (Rubinstein, 1989; Weinstein and Yildiz, 2007).
These findings have motivated new definitions of “nearby" types. Dekel et al.
(2006) characterize the coarsest metric topology on types under which the desired
continuity properties hold. This topology is defined via strategic properties of
types, instead of directly on beliefs. Chen et al. (2010) subsequently developed a
(finer) metric topology on types—the uniform-weak topology—which is defined
explicitly using properties of beliefs. In this topology, two types are considered
49
close if they have similar first-order beliefs, attach similar probabilities to other
players having similar first-order beliefs, and so forth.
The perturbations in beliefs that I allow for are perturbations in the uniform-
weak topology. Specifically, the type spaces that I look at—that is, all type profiles
with common certainty in the predictions of a set of inference rules M—converge in
this topology to the singleton type space containing the type with common certainty
in the true parameter. Thus, robustness to inference can be interpreted as requiring
persistence across a subset of perturbations in the uniform-weak topology.42 A
related study is taken in Morris et al. (2012) and Morris and Takahashi, where
approximate common certainty in the true parameter is considered, instead of
common certainty in a neighborhood of the true parameter.
1.8.3 Agents who learn from data
The set of papers including Gilboa and Schmeidler (2003), Billot et al. (2005), Gilboa
et al. (2006), Gayer et al. (2007), and Gilboa et al. (2013) propose an inductive or
case-based approach to modeling economic decision-making. The present paper
can be interpreted as studying the strategic behaviors of case-based learners when
there is uncertainty over the inductive inference rules used by other agents.
There is also a body of work that studies asymptotic disagreement between
agents who learn from data. Cripps et al. (2008) study agents who use the same
Bayesian inference rule but observe different (private) sequences of data; Al-Najjar
(2009) study agents who use different frequentist rules to learn from data; and
Acemoglu et al. (2015) study Bayesian agents who have different priors over the
signal-generating distribution. My model of belief formation shares many features
42The characterizations of robustness in this paper are possibly unchanged if agents have commonp-belief in the predictions of inference rules in M, where p ! 1 as the quantity of data tends toinfinity. I leave verification of this for future work.
50
with these inference rules, but the main object of study is the convergence of
equilibrium sets, instead of the convergence of beliefs.
Finally, Steiner and Stewart (2008) study the limiting equilibria of a sequence of
games in which agents use a kernel density estimator to infer payoffs from related
games. This paper is conceptually very close, but there are several important
differences in the approach. For example, Steiner and Stewart (2008) suppose that
agents share a common inference rule and observe endogenous data (generated by
past, strategic actors), while I suppose that agents have different inference rules
and observe exogenous data. Additionally, the (common) inference rule in Steiner
and Stewart (2008) is not indexed by the quantity of data, so the limit of their
learning process is a game with heterogeneous beliefs, whereas the limit of my
process is a game with common certainty of the true distribution.
1.8.4 Model uncertainty
Consideration of model uncertainty in game theory is largely new, but similar
ideas have been advanced in several neighboring areas of economics. Eyster and
Piccione (2013) study an asset-pricing inference rule in which agents have different
incomplete theories of price formation. The set of papers including Hansen and
Sargent (2007), Hansen and Sargent (2010), Hansen and Sargent (2012), and Hansen
(2014), among others, consider the implications of model uncertainty for various
questions in macroeconomics. In their framework, a decision-maker considers
a set of model (prior distributions) plausible, and uses a max-min criterion for
decision-making.
51
1.8.5 Epistemic game theory
I extensively use tools, results, and concepts from various papers in epistemic
game theory, including Monderer and Samet (1989), Brandenburger and Dekel
(1993), Morris et al. (1995), Dekel et al. (2007), Chen et al. (2010). The notion of
common certainty in a set of first-order beliefs was studied earlier in Battigalli and
Sinischalchi (2003).
1.9 Discussion
Directions for future work include the following:
Endogenous data. In this paper, data is generated according to an exogenous
distribution. An important next step is to consider data generated by actions
played by past strategic actors. In this dynamic setting, past actions play a role in
coordinating future beliefs via the kind and quantity of data generated.
Optimal informational complexity design. Suppose a designer has control over
the complexity of information disclosed to agents in a strategic setting. Using the
approach developed in this paper, the designer’s choice of complexity influences
the commonality in beliefs across agents. When will he choose to disclose simpler
information, and when will he disclose information that is more complex? If the
designer’s interests are opposed to those of the agents, should a social planner
regulate the kind of information he can provide?
Confidence in predictions. An action profile is usually thought of as having the
binary quality of either being, or not being, a solution. The approach in this paper
may provide a way to qualify such statements with a level of confidence. In this
paper, pn(a) describes the analyst’s confidence in predicting a given n observations.
I hope to extend these ideas towards construction of a cardinal measure for the
52
strength of equilibrium predictions across different games.
53
Chapter 2
The Theory is Predictive, but is it
Complete? An Application to
Human Perception of
Randomness1
2.1 Introduction
When we test theories, it is common to focus on what one might call correctness:
do the predictions of the theory match what we see in the data? For example, we
can test a theory that says that wages are determined by one’s knowledge and
capabilities, by looking at whether more education indeed predicts higher wages
in labor data. Such a finding would provide evidence in support of the theory, but
little guidance towards whether an alternative theory might fit the data even better.
1Co-authored with Jon Kleinberg and Sendhil Mullainathan
54
Beyond correctness, we also care about this latter feature, which we will refer to as
completeness: how much of the explainable variation in the data is captured by the
theory?
Measurement of the completeness of our theories is important because it
provides guidance on the marginal (predictive) gain of improvement in modeling.
If our models are fairly complete, then new theories should not be expected
to drastically improve predictive power (they may, of course, aid towards other
goals—for example, by providing new conceptual insight into the problem). In
contrast, if our models are far from complete, then new models have the potential
to lead to large improvements in prediction. Completeness therefore guides both
our understanding of the achievements already made by existing models, and also
of the potential progress that remains ahead.
Despite an interest in completeness, we focus on correctness in social science
for a pragmatic reason. We can measure the fit of any given theory to data, but
we have no intuition for what constitutes a “good" fit. For example, suppose
we are interested in predicting a binary variable and find that a given theory
predicts accurately in 55% of observed trials. Is this achievement significant? For
certain problems—e.g. prediction of changes in stock returns given a past history
of returns—55% accuracy is a stunning success. In other problems—e.g. prediction
of college matriculation given socioeconomic and other personal characteristics—it
is only mediocre. This significant variation in predictability across problems means
that perfect accuracy is not a universally appropriate benchmark for our theories:
we need to understand how well a theory’s predictive power lines up against some
best achievable accuracy.
The purpose of this paper is to propose a practically implementable way to gen-
erate this benchmark, via methods in machine learning. As we discuss more fully
55
in Section 5, this is an approach that is also being proposed contemporaneously
with our work by Peysakhovich and Naecker (2016). Recent advances in machine
learning have enabled substantial progress in problems of prediction, but are often
criticized for using atheroetical and uninterpretable models, for example by search-
ing for the best prediction function over a large set of explanatory variables. The
resulting prediction functions perform well empirically but rarely reveal a deep
theoretical structure.
Rather than considering machine learning as a replacement for existing theories,
a role for which it (currently) seems ill-suited, our goal is to leverage its techniques
towards the alternative goal of assessing theory completeness. The approach
we propose is simple: compare the performance of existing (interpretable and
economically meaningful) models to the performance of atheoretical machine
learning algorithms.
We illustrate this approach on a simple problem with a long history of study in
psychology and behavioral economics: human generation of random sequences. It
is well documented that humans misperceive randomness (Bar-Hillel & Wagenaar,
1991; Kahneman & Tversky 1972), with implications in many economic settings
(Rabin and Vayanos, 2010; Chen et al.). Leading models for human misperception of
randomness include Rabin (2002) and Rabin and Vayanos (2010). We are interested
in assessing the success of these models towards predicting human generation of
fair coin flips.
To this end, we use the platform Mechanical Turk to collect 14,050 strings of
length eight, produced as if by flipping a fair coin several times in succession. We
ask two questions. First: can we predict the eighth flip in a string if we are given
the first seven? Second: can we separate human-generated strings from strings
generated by a true Bernoulli(0.5) process in a mixed sample?
56
We adapt Rabin (2002) and Rabin and Vayanos (2010) for these prediction
problems, and find that these models achieve (mean-squared) prediction errors
of approximately 0.249. These prediction errors are improvements upon a naive
baseline of 0.25 (corresponding to guessing at random). The problem of interest
regards the interpretation of the improvement of approximately 0.001 on the naive
baseline. How significant is this reduction in prediction error, and how much could
we hope to improve upon it? To answer these questions, we need a benchmark for
achievable prediction error.
There are several ways to construct such a benchmark using machine learning
prediction techniques. For our first set of results, we draw on an appealing
property of our domain: despite its conceptual richness, it has a compact enough
representation that we can construct an essentially “perfect” benchmark based
on table lookup, an algorithm that uses the empirical distribution of the full set of
combinatorially distinct strings in a training set to predict new strings. Using this
approach, we achieve prediction errors of approximately 0.243. If we take this to
be our benchmark, then existing behavioral models produce roughly 12% of the
achievable improvement in prediction error for this problem.
In the remainder of the paper, we outline and respond to two potential critiques
of this method for measuring completeness. The first is feasibility. Table lookup
can be implemented in the given problem because of the size of the domain space,
but is not a generally viable strategy. How sensitive is the estimated measure of
completeness to the choice of machine learning algorithm? Towards this concern,
we report the the prediction error achieved by two alternative (standard) machine
learning approaches: LASSO regression and a decision tree algorithm. The best
of these algorithms achieves a prediction error of approximately 0.243, suggesting
that the benchmark constructed using table lookup may be approximated by
57
other machine learning algorithms that scale to problems with more complex
representations.
A second direction of concern regards whether the estimated ratio is special to
the problem of prediction of eight-length strings of coin flips. This may be the case
if, for example, table lookup succeeds by capturing specific features of generation
of eight-length H/T strings that do not generalize to related problems. We thus
examine the stability of estimated completeness when we change the prediction
task to predicting data collected in related but non-identical contexts. Specifically,
we learn a model for human generation of randomness using the original data of
eight-length coin flips, and then use this model to predict strings generated in a
nearby domain.
We consider two variations in the domain. In Section 3.1, we relabel the possible
outcomes: instead of asking subjects to generate binary strings generated as if
repeatedly flipping a fair coin labelled “Heads" and “Tails", we impose new frames
in which the coin is either labelled “@" on one side and “!" on the other, or “r"
on one side and “2" on the other. In Section 3.2, we change the index of the flip
to be predicted: instead of asking subjects to generate eight repeated coin flips
and predicting the eighth, we ask subjects to generate fifteen coin flips and try to
predict flips 9-15. We find that in these modified prediction problems, the existing
models produce between 7-30% of the improvement in prediction error obtained
using table lookup, providing evidence that the benchmark and ratio discovered
previously are indeed stable across local problem domains.
Taken together, these results suggest that: (1) there is a significant amount of
structure in problem of prediction of human generation of randomness that existing
models have yet to capture and (2) machine learning may provide a generally viable
approach to testing theory completeness.
58
2.2 Primary setting: human generation of coin flips
2.2.1 Description of data
We asked 334 subjects on Mechanical Turk to generate 50 binary strings of length
eight, each, as if these strings were the realizations of 50 experiments in which a
fair coin was flipped 8 times. The task was described to subjects using the text
below:
We are researchers interested in how well humans can produce random-ness. A coin flip, as you know, is about as random as it gets. Your jobis to mimic a coin. We will ask you to generate 8 flips of a coin. You areto simply give us a sequence of Heads (H) and Tails (T) just like whatwe would get if we flipped a coin.
Important: We are interested in how people do at this task. So it isimportant to us that you not actually flip a coin or use some otherrandomizing device.
Entry of the coin flips was implemented through eight drop-down menus, each of
which had the options “H" and “T". Subject effort was incentivized through the
following text:
To encourage effort in this task, we have developed an algorithm (basedon previous Mechanical Turkers) that detects human-generated coinflips from computer-generated coin flips. You are approved for pay-ment only if our computer is not able to identify your flips as human-generated with high confidence.
Additionally, to discourage use of an external randomizing device, we required
subjects to complete each string in 30 seconds or less. The complete set of directions
can be found in Appendix A.
The “algorithm" that we use for detection of lazy subjects identified the 26
strings whose empirical frequency exceeded the 90th percentile across strings
59
(0.0059). These strings were categorized as over-generated, and subjects who pro-
duced 20 or more over-generated strings were removed from the data. In total, this
criterion removed 53 subjects, or 2900 strings. The remaining dataset consists of
a total of 281 unique subjects, or 14,050 unique strings. Throughout, we identify
Heads with ‘1’ and Tails with ‘0,’ so that each string is an element of {1, 0}8.
Figure 2.1: (a) Top row. Distribution of the number of heads in the realized string. Left: Comparison ofMTurk data with theoretical Bernoulli predictions. Right: Comparison of Nickerson & Butler (2009) data withtheoretical Bernoulli predictions. (b) Bottom row. Distribution of proportion of runs which are of length m.Left: Comparison of MTurk data with theoretical Bernoulli predictions. Right: Comparison of Nickerson &Butler (2009) data with theoretical Bernoulli predictions.
We find that the observed distribution over strings is unlikely to have been
generated by a true Bernoulli(0.5) process: the hypothesis that the true distribution
over {1, 0}8 is uniform is rejected under a c2 test with p ⇡ 0. Moreover, the nature
of mis-generation is qualitatively consistent with comparative references (we use
Nickerson and Butler (2009) and Rapaport and Budescu (1997)). For example,
there is an over-tendency towards alternation (52.16% of flips are different from
the previous flip, as compared to an expected 50% in a Bernoulli(0.5) process), an
60
under-tendency to generate strings with “extreme" ratios of Heads to Tails (see the
top row of Figure 2.1), and an under-tendency to generate strings with long runs
(see the bottom row of Figure 2.1).
Additionally, subjects display strong context-dependency: the probability of
reversal depends on several previous flips. Table 2.1 compares this dependence
in our data with statistics reported in Rabin and Vayanos (2010) (using data from
Rapaport and Budescu (1997)), listing the respective probabilities that various
three-flip patterns are followed by ‘1.’ Except for a much softer contrast in our data
between the probability with which ‘0000 and ‘1110 are followed by ‘1’, we find that
these conditional probabilities are quite similar.
Our data Rapaport and Budescu (1997) Bernoulli0 1 0 0.5995 0.588 0.51 0 0 0.5406 0.62 0.50 0 1 0.5189 0.513 0.50 0 0 0.5185 0.70 0.51 1 1 0.4811 0.30 0.50 1 1 0.4595 0.38 0.51 1 0 0.4528 0.487 0.51 0 1 0.4415 0.412 0.5
Table 2.1: The empirical probability of Heads, conditional on three fixed previous flips: (1) the actualproportion of generated Heads in our data, (2) the assessed probability of Heads next flip from Rapaport &Budescu (1997), as presented in Rabin and Vayanos (2010), (3) probabilities consistent with a Bernoulli(0.5)process.
2.2.2 Theories of misperception
Motivated by empirical findings such as those described above, several frameworks
have been proposed for modeling human misperception of randomness. We
consider in particular the two approaches proposed in Rabin (2002) and Rabin and
Vayanos (2010).
61
Rabin (2002) models subjects who observe i.i.d. signals, but mistakenly believe
them to be negatively autocorrelated. Specifically, subjects observe a sequence of
i.i.d. draws from a Bernoulli(q) distribution, where q 2 [0, 1] is an unknown rate
drawn from distribution p. Although subjects know the correct distribution p,
they have a mistaken belief about the way in which the realized rate q determines
the signal process. Subjects believe that the observed signals are drawn without
replacement from an urn containing qN ‘1’ signals and (1 � q)N ‘0’ signals, so
that a signal of ‘1’ is less likely following observation of ‘0’, and vice versa. For
convenience, the author imposes an additional, stylized, assumption in which
subjects believe that the urn is “refreshed" every other round, meaning that the
composition is returned to qN ‘1’ signals and (1 � q)N ‘0’ signals.
This model is primarily intended as a model of mistaken inference, and not
directly as a model of generation of random sequences, so a few adaptations are
needed to carry this model into our setting. We alter the model in the following
ways: First, since subjects are told the bias of the coin (fair), we fix the distribution
p over rates so subjects know that q = 0.5 with certainty. Second, we relax the
assumption that the urn is refreshed deterministically every other round, adding
in a second parameter p 2 [0, 1], which determines the probability that the urn is
refreshed. In this revised model, subjects generate random sequences by drawing
without replacement from an urn that is initially composed of 0.5N ‘1’ balls and
0.5N ‘0’ balls, and is subsequently refreshed with probability p before every draw.
The model proposed in Rabin and Vayanos (2010) is similar in spirit, but richer.
We use the following version of their model, which is closest to our setting. Each
subject generates the first flip, s1, according to a true Bernoulli(0.5) distribution.
62
Then, each subsequent flip, sk, is determined according to
sk ⇠ Ber
0.5 � a7
Âk=0
dk(2s7�k � 1)
!
where the constant d 2 [0, 1] captures a (decaying) influence of past flips, and
the constant a � 0 measures the strength of negative autocorrelation. Notice that
2sk � 1 = 1 if sk = 1 and 2sk � 1 = �1 if sk = 0, so that past instances of ‘1’
reduce the probability that the k-th flip is ‘1’, and past instances of ‘0’ increase this
probability.
2.2.3 Prediction tasks
We test these theories by looking at how well they predict features of our data. We
focus on two tests in particular. In the first test, which we refer to as continuation,
we try to guess a subject’s eighth flip from his first seven flips. A prediction rule
for this problem is any map
f : {0, 1}7 ! [0, 1]
that takes 7-length strings into the probability that the eighth flip is ‘1’. The error
in predicting a dataset {si}ni=1 of n strings is measured using mean-squared error:
L( f ) =n
Âi=1
⇣
si8 � f (si
1:7)⌘2
.
To gain some intuition on the possible values of L( f ), let us briefly note the
following.
Fact 1 Suppose that strings in {si}ni=1 are generated i.i.d. from a Bernoulli(0.5) distribu-
tion. Then, E[L( f )] � 0.25 for every possible prediction rule f .
That is, if subjects are truly generating strings according to a Bernoulli(0.5) process,
63
then we cannot improve on an expected mean-squared error of 0.25. If, instead, sub-
jects are generating strings according to either of the behavioral models described
above, then we can do better by leveraging the first seven flips. Optimal prediction
rules (in the sense of minimizing expected mean-squared loss) are defined for each
of these models in Appendix B, and denoted fR and fRV respectively.
In the second test, which we refer to as classification, we are presented with
a dataset of strings—half generated by human subjects, and half generated by a
Bernoulli(0.5) process—and seek to separate the human-generated strings from the
computer-generated strings. A prediction rule in this problem is any map
c : {0, 1}8 ! [0, 1]
from eight-length strings into a probability that the string was generated by a
human subject. The error in predicting a dataset {si}ni=1 of n strings is measured
using mean-squared error
L(
c)
=
n
Âi=1
⇣
ci � c⇣
si⌘⌘2
,
where ci= 1 if the true source of generation for string si was a human subject, and
ci= 0 otherwise.2 In an abuse of notation, we use L to refer to the loss function
in both problems, trusting that no confusion will arise. As above, we have the
following.
Fact 2 Suppose that strings in {si}ni=1 are generated i.i.d. from a Bernoulli(0.5) distribu-
tion. Then, E[L(c)] � 0.25 for every possible prediction rule c.
2A brief comment on the relationship between these two prediction tasks. One may wonderwhether success on one implies success on another. This need not be so. Observe that success onthe continuation task is achieved by correctly assessing the likelihood of s1:71 versus s1:70. If thisratio is close to the true ratio for every s1:7, the unconditional probability of either string s1:71 or s1:70occurring can be very off.
64
Thus, if strings are truly generated according to a Bernoulli(0.5) process, then we
cannot improve on an expected prediction error of 0.25. If, instead, the strings
are generated according to either behavioral model above, then we can improve
upon this error. We define the optimal prediction rules (in the sense of minimizing
expected mean-squared loss) for these models in Appendix B, and refer to them
(respectively) as cR and cRV .
Following, we test prediction rules fR, fRV in the continuation task on the
Mechanical Turk data, and prediction rules cR, cRV in the classification test, given a
merged dataset consisting of the Mechanical Turk data and an equal number of
strings generated according to a Bernoulli(0.5) process. The reported prediction
error is obtained using ten-fold cross validation: we (randomly) partition the data
into 10 equally-sized subsets, estimate the free parameters of the model on nine
subsets (the training set), and predict the strings in the tenth (test set). The reported
error is an average across choices of test set.
Continuation ClassificationGuessing 50-50 0.25 0.25
Rabin (2002) 0.2495 0.2493(0.0001) (0.0001)
Rabin and Vayanos (2010) 0.2491 0.2495(0.0001) (0.0001)
Table 2.2: Prediction errors achieved using Rabin (2002) and Rabin (2010) are improvements on theprediction error achieved by guessing at random. How do we assess the size of this improvement?
In Table 2.2, we compare the obtained prediction errors with a naive baseline in
which we predict ‘1’ with probability 0.5 in the continuation task, and classify each
string as human generated with probability 0.5 in the classification task. The core
motivation for this paper is clearly seen here: Both behavioral models are more
predictive than the naive baseline, but the improvement on the naive baseline of
65
(up to) 0.0009 is extremely difficult to interpret. How much have these models
improved upon the naive prediction rule, and how much could we further hope to
improve upon it? To answer these questions, we need a benchmark for obtainable
prediction error that is much more suitable than 0.
2.2.4 Establishing a benchmark
As we discussed in the introduction, our proposed benchmark is the prediction
error achieved using a technique we refer to as table lookup.
Definition 7 (Table Lookup) Let g be the empirical distribution over strings in the
training data. The table lookup continuation rule is
fT(s1:7) =g(s1:71)g(s1:7)
for all s1:7 2 {1, 0}7. (2.1)
where ‘s1:71‘ is the concatenation of strings s1:7 and ‘1.’ The table lookup classification rule
is cT(s) = g(s).
In the continuation task, the table lookup prediction rule assigns to every string
s1:7 2 {1, 0}7 the empirical frequency with which the string is followed by ‘1’ in
the training data. In the classification task, the table lookup prediction rule assigns
to every string its empirical frequency in the training data. Under the assumption
that strings are i.i.d. across subjects, the prediction error achieved using this rule
approaches the “lowest possible" prediction error with sufficient training data. This
is equivalent to the irreducible error in the problem, or the Bayes error rate.
Table 2.3 compares the prediction error achieved using the behavioral models
above with the prediction error achieved using table lookup. As before, prediction
error is evaluated using 10-fold cross validation.
We find that table lookup achieves a prediction error of 0.2425 in the contin-
uation task and 0.2430 in the classification task. These errors are far from 0, so
66
this is a nontrivial modification from a benchmark of no error. A simple measure
of “completeness" of the existing theories is the ratio of improvement in predic-
tion error achieved by the best behavioral model (over the naive approach) to the
improvement in prediction error achieved by table lookup. For example, in the
continuation task,
ratio of improvement =0.25 � min(0.2495, 0.2491)
0.25 � 0.2425= 0.1233
and in the classification task,
ratio of improvement =0.25 � min(0.2494, 0.2495)
0.25 � 0.2430= 0.0857.
Taking table lookup as our benchmark for obtainable prediction error, these
results suggest that existing behavioral models produce between 9-12% of the
achievable improvement in prediction error.
Continuation ClassificationBernoulli 0.25 0.25
Rabin (2002) 0.2495 0.2494(0.0001) (0.0001)
Rabin and Vayanos (2010) 0.2491 0.2495(0.0001) (0.0001)
Table Lookup 0.2425 0.2430(0.0001) (0.0001)
Completeness using TL as a benchmark 0.1233 0.0857
Table 2.3: Comparison of prediction error achieved using behavioral models with prediction error achievedusing table lookup. The behavioral models explain between 9-12% of the explainable variation in the data.
2.2.5 Other possible benchmarks
Table lookup is feasible in this problem because of the size of the domain space (27
unique strings in the continuation task, and 28 unique strings in the classification
67
task). We can in fact search the space of predictive models to optimality, an
approach that will not be feasible in every problem. Can we use other (practically
implementable) machine learning algorithms as surrogate benchmarks in these
other cases?
In this section, we consider use of two alternative benchmarks. First, we use
LASSO regression to select a prediction rule that depends only a small set of
covariates. The set of features we consider are:
• the empirical frequency of alternations (the probability that flip sk is followed
by its opposite, averaged across all k)
• indicators for the existence of runs of length 2, 3, . . . , 8 in the string
• the total number of ‘1’s in the string
• the index for the first occurrence of ‘1’ in the string
and their interactions up to degree 3. This yields a total of 176 features (including
the intercept). Let x(s) 2 R176 denote the feature vector corresponding to string s,
and define prediction rules
fb(s1:7) = x(s1:7)T b for all s1:7 2 {1, 0}7
cb(s) = x(s)T b for all s 2 {1, 0}8
Then, in the continuation problem, the LASSO coefficient vector b solves
argminb2R176
N
Âi=1
L( fb) + lkbk1,
where kxk1 denotes the l1 norm of the vector b (the sum of the absolute values
of its components). In the classification problem, the LASSO coefficient vector b
68
solves
argminb2R176
N
Âi=1
L(cb) + lkbk1.
Second, we implement a decision tree algorithm using the same feature set.3 We
consider the class of decision trees in which each node considers a single feature,
and each branch corresponds to a different value (or set of values) for the feature.
New inputs are classified by proceeding down the decision tree to a terminal node,
which determines the class.
Table 2.4 below summarizes the prediction errors achieved using LASSO and
decision trees in both problems, and reports also the measure of completeness,
using these alternative prediction errors as benchmarks.
Continuation ClassificationTable lookup 0.2425 0.2430
Completeness using TL as a benchmark 0.1233 0.0857
LASSO 0.2460 0.2484Completeness using LASSO as a benchmark 0.225 0.375
Decision tree 0.2419 0.2434Completeness using decision trees as a benchmark 0.1111 0.9091
Table 2.4: Comparison of prediction error achieved using behavioral models with prediction error achievedusing table lookup.
The LASSO prediction rules yield a (tenfold cross-validated) prediction error
of 0.2460 in the continuation problem and 0.2484 in the classification problem.
Decision trees yield a (tenfold cross-validated) prediction error of 0.2419 in the
continuation problem and 0.2434 in the classification problem.4 Additionally, we
find that the estimated measure of completeness varies from 11% to 23% in the
3Decision trees are a recursive partitioning of the domain space. Formally, a decision tree is arooted tree, in which each node splits the domain space into two or more subspaces according to adiscrete function of the input values.
4The slight improvement of decision trees upon table lookup should be interpreted as noise dueto the finite size of our training data.
69
continuation task, and from 9% to 37% in the classification task, depending on the
choice of algorithm. These results suggest that the benchmark constructed using
table lookup may be approximated by other machine learning algorithms that scale
to problems with more complex representations.
2.3 Transfer across Domains
The previous sections established a benchmark error of roughly 0.24, and discov-
ered that existing behavioral models achieve approximately 10% of the possible
improvement in prediction error (above a naive baseline). How special are these
results to the particular setting that we considered? Should we interpret this
benchmark and measure of completeness as pertaining only to human generation
of eight-length H/T strings, or do they express a more general truth about the
predictability of human generation of random sequences, and the extent to which
our existing models have attained this?
One reason to be concerned is the possibility that behavioral models capture
fundamental aspects of human generation of random sequences (not special to
production of fair coin flips), while table lookup relies on specific features of
generation of eight-length H/T strings that do not generalize to related problems.
For example, 57% of strings begin with ‘1’. The prediction rules derived from Rabin
(2002) and Rabin and Vayanos (2010) don’t leverage this feature for prediction, but
table lookup does. If the predictive accuracy achieved by table lookup is due in
large part to use of features like this, then we should not expect its performance to
generalize.
To address this question, we consider two robustness checks of “transfer pre-
diction" across different framings of the generation problem. Our approach is as
follows. We first estimate the free parameters of all the models (table lookup, and
70
the two behavioral models) on the original dataset of eight-length H/T strings.
Then, we use the estimated models to predict new strings, which are not only out-
of-sample (not used in the estimation of the free parameters), but also produced in a
modified problem domain (the generation problem is framed differently). By looking
at how well the estimated models predict in this new environment, we can assess
whether the features used in table lookup are stable features of misperception
across these various domains, or whether they are specific to the original problem.
We consider two variations in the domain. In Section 3.1, we relabel the possible
outcomes: instead of asking subjects to generate binary strings generated as if
repeatedly flipping a fair coin labelled “Heads" and “Tails", we impose new frames
in which the coin is either labelled “@" on one side and “!" on the other, or “r"
on one side and “2" on the other. In Section 3.2, we change the index of the flip
to be predicted: instead of asking subjects to generate eight repeated coin flips
and predicting the eighth, we ask subjects to generate fifteen coin flips and try to
predict flips 9-15. We find that in these modified prediction problems, the existing
models produce between 7-30% of the improvement in prediction error obtained
using table lookup, providing evidence that the benchmark and ratio discovered
previously are indeed stable across local problem domains.
2.3.1 Prediction of New Alphabets
In the first transfer prediction task, we attempt to predict strings under a relabelling
of the outcome space from {Heads, Tails} to {r, 2}, and to {@, !}. Specifically, we
ask 124 subjects on Mechanical Turk to generate 50 binary strings of length eight
“as if these strings were the realizations of 50 experiments in which a fair coin
labeled ‘r’ on one side and ‘2’ on another was flipped 8 times." We ask another 114
subjects to generate 50 binary strings of length eight “as if these strings were the
71
realizations of 50 experiments in which a fair coin labeled ‘@’ on one side and ‘!’
on another was flipped 8 times".
Following the procedure outlined in Section 2.2, we determine which strings
have an empirical frequency exceeding the 90th percentile, and remove all subjects
who produced 20 or more such strings. We also identify strings with elements
of {1, 0}, mapping ‘r’ and ‘@’ into ‘1’, and ‘2’ and ‘!’ into ‘0.’ We refer to these
datasets of strings respectively as Dr2 and D@!, and the original data as DHT.
The prediction problems we examine are the following. First: Suppose we know
the first seven flips that a subject generated in Dr2 (D@!). How well can we predict
his eighth flip, using only the strings in DHT to train our prediction model? This is
the transfer analogue of the continuation task described in Section 2.4.
Second: Suppose we have a dataset combining the strings in Dr2 (D@!) with an
equal number of strings generated by a Bernoulli(0.5) process. How well can we
separate the human-generated strings from the computer-generated strings, using
only the strings in DHT to train our prediction model? This is the transfer analogue
of the classification task described in Section 2.4.
To answer these questions, we estimate the free parameters in Rabin (2002),
Rabin and Vayanos (2010), and table lookup using DHT, and then use the estimated
models to predict strings in Dr2 and D@!. The resulting prediction errors (reported
below as ten-fold cross validated errors) are listed in Table 2.5, as well as a measure
of completeness, using the table lookup prediction error as a benchmark.
The most important components of the table above are the following. First, we
find that the benchmarks of 0.2425 and 0.2430 discovered previously are incredibly
robust across the different framings: table lookup achieves prediction errors ranging
from 0.2431 to 0.2456. The easure of completeness ranges between 7% and 18%,
and is comparable to the range of nine to 12% found previously.
72
Continuation Classification
{r, 2} {@, !} {r, 2} {@, !}Guessing 50-50 0.25 0.25 0.25 0.25
Rabin (2002) 0.2493 0.2499 0.2499 0.2493(0.0001) (0.0001) (0.0001) (0.0001)
Rabin and Vayanos (2010) 0.2491 0.2497 0.2491 0.2501(0.0001) (0.0001) (0.0001) (0.0001)
Table Lookup 0.2451 0.2456 0.2431 0.2434(0.0001) (0.0001) (0.0001) (0.0001)
Completeness using TL as a benchmark 0.1836 0.0682 0.1304 0.1061
Table 2.5: We train table lookup and our two behavioral models on the original 8-length {H, T} data, andthen use the estimated data to predict 8-length {r, 2} and {@, !} data. Reported prediction errors are tenfoldcross-validated mean squared errors.
2.3.2 Prediction of Subsequent Flips
In the second transfer prediction task, we use the original eight-length strings
to predict strings of length fifteen. The data to be predicted was produced by
asking 120 subjects on Mechanical Turk to generate 25 binary strings of length
fifteen “as if these strings were the realizations of 25 experiments in which a fair
coin was flipped 15 times". From this data, we construct seven “ghost" datasets of
eight-length strings, each including only flips k through k + 7, where k 2 {2, . . . , 8}.
Following the procedure outlined in Section 2.2, we identify the strings whose
empirical frequency exceeded the 90th percentile, and remove all subjects who
produced 20 or more such strings. We again identify strings with elements of
{1, 0}, mapping ‘H’ into ‘1’, and ‘T’ into ‘0.’ The datasets are labeled Dk:k+7, with
k = 2, . . . , 8.
The prediction problems we examine are the following. First: Suppose we know
the first seven flips that a subject generated in Dk:k+7. How well can we predict the
final flip, using only the strings in DHT to train our prediction model? This is the
transfer analogue of the continuation task described in Section 2.4.
73
Second: Suppose we have a dataset combining the strings in Dk:k+7 with an
equal number of strings generated by a Bernoulli(0.5) process. How well can we
separate these strings into those that are human-generated and computer-generated,
using only the strings in DHT to train our prediction model? This is the transfer
analogue of the classification task described in Section 2.4.
To answer these questions, we estimate the free parameters in Rabin (2002),
Rabin and Vayanos (2010), and table lookup using DHT, and then use the estimated
models to predict strings in Dk:k+7, k = 2, . . . , 8. Table 2.6 reports prediction errors
obtained using table lookup, and the two behavioral models. We show two results:
first, the prediction error obtained in dataset D8:15 alone; second, the prediction
error averaged across the seven datasets Dk:k+7, k = 2, . . . , 8.
Continuation Classification
Last flip Average Last flip AverageGuessing 50-50 0.25 0.25 0.25 0.25
Rabin (2002) 0.2500 0.2484 0.2474 0.2479(0.0001) (0.0001) (0.0001) (0.0001)
Rabin and Vayanos (2010) 0.2462 0.2466 0.2484 0.2485(0.0001) (0.0001) (0.0001) (0.0001)
Table Lookup 0.2369 0.2391 0.2409 0.2421(0.0001) (0.0001) (0.0001) (0.0001)
Completeness using TL as a benchmark 0.2900 0.3098 0.2857 0.2772
Table 2.6: We train table lookup and our two behavioral models on the original 8-length {H, T} data,and then use the estimated data to predict the data in {Dk:k+7}8
k=2. Reported prediction errors are tenfoldcross-validated mean squared errors.
We find that the table lookup prediction error ranges from 0.2369 to 0.2421,
which is comparable to the range from 0.2425 to 0.2430 discovered earlier, and that
the measure of completeness ranges from 27% to 31%, greater but comparable to
the previous range of nine to 12%.
74
2.4 Discussion
2.4.1 Guarantees on the benchmark
The problem analyzed in this paper can be reformulated as follows. Let X =
{Xk}k�1 be the “human-generated" {1, 0}-valued random process. If the condi-
tional distribution of X8 given the realizations of X1, . . . , X7 is non-degenerate, then
the 8th flip cannot be predicted perfectly from the first seven (and likewise, realiza-
tions of X cannot be perfectly separated from realizations of a true Bernoulli(0.5)
sequence). We want to compare the prediction error achieved using existing models
not with 0, but with the irreducible error
E(
X8 � f ⇤(X1, . . . , X7))2 (2.2)
where the expectation is with respect to the distribution over {1, 0}8 induced by
{X1, . . . , X8}, and f ⇤ is defined by
f ⇤(x1, . . . , x7) = Pr(X8 = 1|X1 = x1, . . . , X7 = x7)
This is the error that would be achieved by using the true model X to predict
realizations of X8, known in the literature as the Bayes risk. Notice that no function
f : {0, 1}7 ! [0, 1] can improve upon f ⇤ in an expected mean-squared error sense.
The prediction error obtained using table lookup is a consistent estimator of
the Bayes risk: as the quantity of training data approach infinity, the prediction
error achieved using fTL approximates (2.2) to arbitrary precision. However, this
approach, which relies on nonparametric estimation of each Pr(1 | x1, . . . , x7) for
every (x1, . . . , x7) 2 {1, 0}7, is not feasible in problems where the domain space
is substantially larger. In these more general settings, approaches such as those
considered in Section 2.5 (LASSO regression and decision trees) are more viable
75
alternatives. Choice of which algorithm is used to generate the benchmark should
rely on domain knowledge regarding the assumptions the process is likely to
satisfy. We refer the reader to a large literature on estimation of Bayes risk for
theoretical guarantees of different approaches.
2.4.2 Covariates
Throughout this paper, we have considered a fixed set of explanatory covariates,
namely the initial seven flips in the first prediction task, and the string itself in the
second. The notion of completeness we have proposed is more precisely stated as
a measure of completeness for this given set of covariates.
The proposed measure of completeness, therefore, is not instructive towards
which additional covariates (beyond initial flips) might be added to improve
prediction, or how much prediction accuracy can be improved by adding these
covariates. What it does allow us to do is separate two potential reasons for
imperfect prediction: low predictive accuracy because we haven’t yet identified
the most predictive covariates, and low predictive accuracy because our model is
using good covariates in a poor way. For example, suppose we find that benchmark
accuracy is low (say, 55%). This indicates that we may gain from investigating
additional possible covariates, rather than try to improve the current classifier
without changing the feature set. If instead, benchmark accuracy is high (say, 80%),
while our achieved accuracy is low (say, 55%), then there are large gains potentially
to be had in prediction, without the addition of any new features.
2.4.3 Transfer learning
The core of Section 3 is a question of a transfer learning: how much can we
improve prediction of strings generated in a given domain, using knowledge of
76
how strings are generated in a related domain? We focused on transfer learning
across human-generated Bernoulli(0.5) sequences with different outcome spaces
and string lengths, and found that prediction of strings generated for a given
outcome space and string length could be substantially improved using data on
strings generated for a different outcome space and a different string length. This
is reassuring: it suggests that transfer learning is possible across close problems.
But a fuller understanding of the generalizability of machine learning ap-
proaches will require investigation into the extent to which transfer learning is
possible. For example, let us consider a larger class of random processes, in which
the probability of ‘1’ varies, or the size of the outcome space varies. Can we
transfer learn across these more substantive changes in the domain? Concretely:
might we use human-generated Bernoulli(0.5) strings to predict human-generated
Bernoulli(0.6) strings, or human-generated Bernoulli(0.5) strings to predict human-
generated Normal(0,1) strings? We leave this large body of questions for future
work.
2.5 Relationship to Literature
The closest paper to ours is Peysakhovich and Naecker (2016), which independently
proposes the use of machine learning algorithms to provide a benchmark on
obtainable predictive accuracy. The authors focus in this paper on the domain
of choice under uncertainty, assessing the ability of classical models to predict
choices over (risky and ambiguous) lotteries. They use regularized regression as a
benchmark, and find that classical models explain a larger fraction of achievable
prediction error in the domain of risk than in the domain of ambiguity.
The question considered in this paper is also distantly related to classical work
concerning the learnability of a random process. For example, Jackson et al. (1999)
77
derive a Bayesian representation for a stochastic process, with the property that
component distributions are fine enough to be “sufficient for prediction," but coarse
enough so that the components are “learnable." These questions consider asymptotic
properties of the random process, whereas we focus on learning finite properties
of the random process (e.g. the eighth flip of the coin).
Finally, this paper is related to the extensive experimental (Bar-Hillel and
Wagenaar, 1991; Rapaport and Budescu, 1997; Rath, 1966; Edwards; Nickerson and
Butler, 2009; Wagenaar, 1972), empirical (Camerer, 1989; Chen et al.; Gillovich et al.,
1985; Croson and Sundali, 2005), and theoretical (Falk and Konald, 1997; Tversky
and Kahneman, 1971; Barberis et al., 1998; Rabin and Vayanos, 2010) literature on
human misperception of randomness.
2.6 Conclusion
Machine learning has produced techniques that have enabled substantial progress
in various academic disciplines. The question of how these techniques are best
leveraged for advancing research in the social sciences remains open. The purpose
of this paper is to propose one potential use of these techniques: towards assess-
ing theory “completeness." We illustrate this proposal on the simple problem of
predicting human generation of fair coin flips. We show that the prediction error
obtained using the algorithm “table lookup" is a useful benchmark for evaluation of
the success of existing behavioral models—and in particular, that existing models
explain up to 30% of the predictable variation in the problem. Moreover, this
benchmark is robust across related problems, suggesting that machine learning
may provide a generally viable approach to testing theory completeness.
78
Chapter 3
Interpretation of Inconsistent
Choice Data: How Many
Context-Dependent Preferences
are There?
3.1 Introduction
In the simplest model of choice, an individual’s preference is described as a linear
ordering over alternatives in a set X, and his choice from any subset A ✓ X is the
-maximal element in A. It is common to interpret choice data under this model, and
to infer a single ordering best fit to the data. In practice, however, choice datasets
may result from the maximization of several different preferences. For example:
1. Heterogeneity in preference across choice domains. Choice data frequently pools
observations drawn from a variety of choice domains, but agents may have
79
different preferences in different domains. For example, Einav et al. (2012)
study the commonality of financial risk preferences across six choice domains
— including 401(k) asset allocations, short-term disability insurance, and
insurance choices regarding health, drug, and dental expenditures — and find
that just over 30% of their sample makes decisions that can be simultaneously
rationalized over all six domains.
2. Multiple behavioral selves. There is extensive empirical evidence that preference
varies with external details of the choice environment, for example the
framing of the problem (Kahneman and Tversky, 2000), the presence of
default options (Beshears et al., 2008), and the addition of decoy options
(Huber et al., 1982). Outside of the laboratory, it is unusual for external details
to remain constant across every observed choice; choices may therefore reflect
different behavioral biases in different observations.
3. Multiple representative agents. Choice data is often pooled from a population,
across which there may be subpopulations or clusters of individuals with
different preferences. Ideally, agents in different clusters are identifiably
different, but in practice the observable characteristics of these agents may be
indistinguishable (see, for example, Crawford and Pendakur (2012)).
In each of these settings, the analyst may not know beforehand the number of
context-dependent preferences maximized in the data. Since context-dependence is
a persistent feature of preference, knowing the number of contexts is important for
two reasons: 1) Welfare implications—it tells us whether standard welfare analysis is
appropriate for interpretation of the observed choices, or whether there is genuine
preference variation that should be elicited to inform normative statements; 2)
Prediction—it tells us whether to interpret the observed inconsistency as noise, or
80
as systematic variation that can inform prediction of future choices.
The purpose of this paper is to provide a tool for determining the number
of context-dependent preferences maximized in the data. The challenge is that
inconsistencies may reveal genuine context-dependencies, but may also simply
reveal choice error (for example, due to inattention by the subject or measurement
error by the analyst). At extremes, we can rationalize the data using only multi-
plicity in rationales (in which case every observation in conflict is described with
a new ordering) or using only choice error (in which case every observation in
conflict is described as a mistake). How can the analyst recover the true number of
context-dependent preferences from choice data?
This paper proposes a simple approach using regularization1, a statistical
technique in which a penalty is imposed on the complexity of the learned model to
prevent overfitting. Regularization techniques have received tremendous interest
in recent decades in the applied mathematics and computer science communities
due to their ability to recover various sparse structures from noisy or otherwise
corrupted data. Applications range from recovery of signals (Donoho and Huo,
2001; Elad and Bruckstein, 2002) to repair of damaged images and videos (Ren
et al., 2012; Yang et al., 2013) to lyrics and music separation (Huang et al., 2013).2
The sparse structure in my problem is preference: while the agent may possess as
many as |X|! context-dependent-orderings over X, I assume that the number of
1For example, two common regularization techniques for learning regression function f (x) = xbextend the usual OLS approach, min Âi(yi � xib)
2, to:
• min Âi(yi � xib)2+ l||b||2 (ridge regularization), and
• min Âi(yi � xib)2+ l||b||1 (lasso regularization).
Ridge regression penalizes the complexity of the learned model through the l2 norm on the vector ofcoefficients, and lasso regression penalizes the l1 norm.
2These ideas and techniques have recently begun to be applied to economic problems. See forexample Belloni et al. (2011), Belloni and Chernozhukov (2011), Belloni et al. (2012), and Gabaix(2014).
81
orderings he maximizes is much smaller.
I adapt methods from this literature to suggest an “optimal" number of order-
ings to use in describing the data. A best multiple-ordering rationalization (BMOR) is
defined as a solution to the following linear programming program:
argminR2R E(D, R) + l|R|, (3.1)
where R consists of all sets of orderings over X, E(D, R) is the number of observa-
tions in dataset D that are inconsistent with maximization of any ordering in R, and
l 2 R+
is a constant. The program in (3.1) thus maximizes fit (by minimizing the
number of unexplained observations E(D, R)) subject to a penalty on the number
of orderings used (by minimizing |R|), and the constant l trades off between these
two goals.
Notice that the problem in (3.1) nests as special cases two well-known ap-
proaches in the literature. The choice of l = 0 returns the Houtman and Maks
(1985) solution: the ordering that explains the largest number of observations in
the data. The choice of l � 1 returns the Kalai et al. (2002) solution: the smallest
set of orderings that explains all of the data. The problem in (3.1) generalizes these
two approaches by considering intermediate values of l 2 (0, 1).
For what choices of l, and with what guarantee, does the solution to (3.1)
recover features of the agent’s preference? The main part of the paper shows that
if choice data is generated by any in a large class of models, there is an interval of
choices of l for which the approach in (3.1) exactly recovers the correct number of
orderings with probability exponentially close to 1. The class of data-generating
processes I consider is the following. Let F be a (finite) set of K contexts and
let R = { f } f2F be a set of context-dependent preferences.3 A choice problem is a
3In the special case in which the set of contexts is a partition of the power set on X, R describes a
82
pair (A, f ) consisting of a choice set A and a context f . In each choice problem,
the the agent selects the f -optimal alternative in A with probability at least 1 � p;
otherwise, he trembles and selects a different alternative. I do not impose any
parametric assumptions on either the pattern of error or the nature of the orderings.
Taking K = 1 and p = 0 returns the canonical single-ordering model, and taking
K � 1 and p = 0 returns the generalized choice functions proposed independently
in Salant and Rubinstein (2008) and Bernheim and Rangel (2009).
My main result (Theorem 1) shows that if the number of orderings K is suffi-
ciently small (relative to the number of possible orderings, |X|!), the probability p
of erring in each choice is sufficiently low, and the choice implications of the prefer-
ence orderings in R are sufficiently different (see Section 5.2), then the problem in
3.1 recovers the exact number of orderings K with probability exponentially close
to 1 (in quantity of data). Section 5.3 qualifies this result: while we can recover the
number of orderings, it is not in general possible to recover the orderings. I make
preliminary comments towards extension of the approach to recover further details
of preference.
Section 6 is the literature review, and Section 7 concludes.
3.2 Notation
Let X denote the set of choice alternatives, A denote a typical subset of X, and
denote the set of all subsets of X. I refer to A as a choice set, and as the set of all
choice sets. A choice observation (x, A) is a pair denoting selection of alternative x
from choice set A. A dataset is a collection of choice observations
D = {(x, A) | A 2 A},
set of menu-dependent preferences.
83
where A is a (multi)set of elements from .
A strict linear ordering � is a complete, antisymmetric, and transitive binary
relation on X. Let R be the set of all permutations of (1, 2, . . . , N), with typical
element . Identify every linear ordering � with the permutation or preference
ordering = (r1, r2, . . . , rN) 2 R satisfying ri < rj if and only if xi � xj. (For example,
x1 � x3 � x2 is identified with = (1, 3, 2).) Coordinate ri can be interpreted as
the ordinal rank of alternative xi according to the ordering �. The choice function
c :! X induced by takes every choice set A 2 to the -maximal element in A. Say
that dataset D is consistent if there exists an ordering such that (x, A) 2 D only if
x = c(A), and inconsistent otherwise. Two choice observations (x, A) and (x0, A0)
are said to be in violation of the Independence of Irrelevant Alternatives axiom (IIA) if
x, x0 2 A and also x, x0 2 A0, so that they cannot be rationalized by the same strict
ordering.
For a given set of orderings R 2 R, define
C(R) :=n
(x, A) | A ✓ and x 2 {c(
A), 2 R}o
to be the set of all choice observations consistent with maximization of some
ordering in R. I refer to these as the choice implications of the set R. For example, let
X = {x1, x2, x3} and define orderings 1 = (1, 2, 3) and 2 = (1, 3, 2). Then,
C(R) ={(x1, {x1, x2}), (x1, {x1, x3}), (x2, {x2, x3}),
(x3, {x2, x3}), (x1, {x1, x2, x3})}.
3.3 Example
I begin by illustrating ideas on a toy choice dataset, and subsequently describe
the general approach in Section 4. Consider a set of choice alternatives X =
84
{x1, x2, x3, x4} and the dataset D consisting of the following 21 observations:
• Choice of x1 from every subset containing x1.
• Choice of x2 from {x2, x3}, {x2, x4}, and {x2, x3, x4}.
• Choice of x3 from {x3, x4}.
• Choice of x4 from every subset containing x4.
• Choice of x3 from {x1, x3}, {x2, x3}, and {x1, x2, x3}.
• Choice of x2 from {x1, x2}.
• Choice of x3 from {x1, x3, x4}.
There are many possible rationalizations of this data. If the analyst uses a single
ordering to explain the data, he can explain up to 10 observations, for example
with
R1 = {x1 � x2 � x3 � x4}.
The minimal obtainable choice error using a single ordering is D1 = 21 � 10 = 11.
If the analyst allows for two orderings, he can explain 20 observations, for example
with
R2 = {x1 � x2 � x3 � x4, x4 � x3 � x2 � x1}.
The minimal obtainable choice error using two orderings is D2 = 21 � 20 =
1. Finally, if the analyst allows for three orderings, he can explain all of the
observations, for example with
R3 = {x1 � x2 � x3 � x4, x4 � x3 � x2 � x1, x3 � x1 � x2 � x4}.
The minimal obtainable choice error using three orderings is D3 = 0. These
observations are collected in the figure below, which graphs the minimal obtainable
85
choice error Dk for values k = 1, . . . , 5.
21
k
3 Number of Orderings k
Choi
ce E
rror
1
11
How should the analyst choose between these solutions? Notice that the
ordering in R1 is consistent with the proposal of Houtman and Maks (1985), which
finds the largest subset of choice observations that can be explained by a single
ordering4. The set of orderings R3 is consistent with the proposal of Kalai et al.
(2002), which finds the smallest number of orderings that can perfectly explain all
observations.
This paper proposes an intermediate solution. Define the set of best multiple-
ordering rationalizations to be the solution to
argminR2R E(D, R) + l|R|,
choosing l =
10.1|D| =
12.1 , as proposed in Corollary 3. It is easy to verify that all best
multiple-ordering rationalizations consist of two orderings5. This is because the
gain from introducing a second ordering is nearly half of the dataset (D2 �D1 = 10),
whereas the gain from permitting a third is only a single observation (D3 � D2 = 1).
4This solution is not unique. For example x4 � x3 � x2 � x1 also explains 10 observations.
5And moreover, that this outcome holds for any choice of l 2⇣
110 , 1
⌘
.
86
The proposed approach thus interprets the data as reflecting maximization of two
orderings, with a single choice observation in error.
3.4 Approach
Fix a dataset D. The implied choice error when using a set of orderings R to
rationalize D is defined
E(D, R) := |{(x, A) 2 D : x 6= c(A) for all 2 R}|,
i.e. the number of choice observations in D that cannot be explained as maxi-
mization of any preference ordering 2 R. If we restrict to sets of r orderings, the
minimal obtainable choice error is given by
Dr := minR✓R, |R|=r
E(D, R).
Say that D is r-rationalizable if Dr = 0.
Remark 12 Dataset D is 1-rationalizable if and only if it is consistent.
Remark 13 For every dataset D, there exists a constant L min{|D|, |X|} such that D
is r-rationalizable for every r � L.
To allow for a tradeoff between the “simplicity" of the model (as defined
through the number of orderings) and its fit to the data, I define the set of l-best
multiple-ordering rationalizations of D as follows.
Definition 8 For any l 2 R+
, R⇤ is a l-BMOR of D if
R⇤ 2 argminR✓R
|R|+ lE(D, R). (3.2)
We can interpret l as arbitrating between the two goals of minimizing the number
of orderings and minimizing the number of implied choice errors. Loosely speaking,
87
an ordering is included in R if and only if it explains at least 1l observations that
would otherwise be interpreted as choice error. Thus, as l ! 0, the analyst prefers
to adopt a unique ordering for the agent and interpret the remaining observations
as error, while for large choices of l, the analyst prefers to use as many orderings
as necessary to eliminate choice error. Below, I show that the Houtman and Maks
(1985) solution is selected if l < 1D1
(Claim 4), and the Kalai et al. (2002) solution is
selected if l > 1 (Claim 5).
Claim 4 If l < 1D1
, every solution R⇤ to Eq. (3.2) satisfies |R⇤| 1.
Proof 7 Suppose there exists some l-BMOR R⇤ satisfying |R⇤| = K > 1. Then by the
definition of a l-BMOR, necessarily K + lDK 1 + lD1. Since moreover 1 + lD1 < 2,
it follows that K < 2 � lDK 2. This contradicts the assumption that K > 1.
Claim 5 If l > 1, every solution R⇤ to Eq. (3.2) satisfies |R⇤| = L, where L is the smallest
constant such that the data is L-rationalizable.
Proof 8 Since D is L-rationalizable, clearly no solution R⇤ to Eq. (3.2) will have |R⇤| > L.
Suppose there exists a l-BMOR R⇤ with |R⇤| = K < L. Assign a unique ordering to
each of the DK implied choice errors from rationalizing D with R⇤. This perfectly explains
the data with K + DK < K + lDK orderings, contradicting the assumption that R⇤ is a
l-BMOR.
Geometrically, solutions to (3.2) are described as follows. Let f be the linear
interpolation of points {(k, Dk), k 2 N}, and let F = {(x, y) | y � f (x)} be the
epigraph of f . Then, for any choice l 2 R+
, the problem in (3.2) returns a k-
ordering solution if and only if the line with normal vector (�1,�l) supports F at
(k, Dk). For example, given the dataset provided in Section 3, the line with normal
vector (�1,�l) supports F at (2, D2) for any choice of l 2 � 110 , 1
�
.
88
F
f
2
21
0
3 Number of Orderings
Choi
ce E
rror
Figure 3.1: The problem in (3.2) returns a solution with 2 orderings if and only if the line with normalvector (�1,�l) supports F at (2, D2).
How should the analyst choose l, and under what conditions on the data-
generating process do we find the proposed approach to recover the “true" number
of orderings maximized by the agent?
3.5 Recovery Results
Consider the following class of choice rules. Let F be a set of K contexts, R = { f } f2F
be a set of context-dependent preferences, and p 2 [0, 1] be a probability of
error. Given choice problem (A, f ), suppose that the agent chooses the f -optimal
alternative in A with probability at least 1 � p; otherwise, he trembles and chooses
a different alternative in A. Assume that the analyst does not know
1. the number of contexts.
2. the locations or number of realized errors.
3. the distribution of error.
89
Can he recover the true number of orderings K using the proposed approach in
(3.2)?
In general, recovery of the number of context-dependent preferences will not
be possible. My main result demonstrates, however, that the proposed method will
exactly recover K with probability exponentially close to 1 if: (1) the number of
preferences is small relative to the quantity of data, (2) the probability of error is
small, and (3) the context-dependent preferences are “sufficiently different," in a
sense made precise in Section 5.2.
A natural next question is whether we can do better: in particular, can we
recover either the locations of mistakes, or the set of orderings themselves? In
Section 5.3, I show that even in the absence of choice error (p = 0), recovery
of multiple preferences from choice data alone is in general an ill-posed task.
Proposition 7 states that no sets of three or more context-dependent orderings, and
only special pairs of context-dependent orderings, can be recovered.
3.5.1 Class of choice models
In this section, I show that the class of choice models I consider is fairly general,
including as special cases several familiar models of choice. To describe these
relationships, recall that a random choice rule is a map P from choice problems
(A, f ) to distributions over the elements of A, with the property that the support
of P(A, f ) is a subset of A for every (A, f ). The class of choice models I consider is
equivalent to the class of random choice rules satisfying
P⇣
x(A, f ) | A, f
⌘
� 1 � p for all (A, f ).
where x(A, f ) denotes the f -optimal choice in A. The following special cases are of
interest.
90
Case 1: p = 0 and K = 1. This returns the classic theory of choice in which a
single preference ordering is defined over X and choice satisfies
P(x|A, f ) =
8
>
<
>
:
1 if x = c(
A)
0 otherwise.for all (A, f ) 2 P
where the dependence on f is trivial.
Case 2: p = 0 and K � 1. This returns the model proposed in Salant and
Rubinstein (2008) and Bernheim and Rangel (2009), in which an agent is described
by a set of context-dependent orderings {r f } f2F, and choice satisfies
P(x|A, f ) =
8
>
<
>
:
1 if x = x(A, f )
0 otherwise.for all (A, f ) 2 P .
In the special case in which F is a partition of , this is equivalent to the model
studied in Kalai et al. (2002): a map h :! F cues contexts from choice sets, and
choice from A is the h(A)
-optimal alternative in A.
Case 3. p > 0 and K = 1. If there exist distributions {qA}A2 such that
P(x|A, f ) = qA(Rx,A) for all (A, f ) 2 P ,
then this returns the class of random utility models with choice-set dependent
distributions (as considered, for example, in Fudenberg et al. (2015)). In the special
case in which qA = q for every choice set A, we have the classic random utility
model (Block and Marshak, 1960).
3.5.2 Can we recover the number of orderings?
I provide intuition for the subsequent results by discussing a few negative examples
in which recovery of K using the approach in (3.2) is either impossible or difficult.
In both examples, I fix p = 0 to simplify ideas.
91
Example 8 Define the sets of orderings
R = {(1, 2, 3), (1, 3, 2), (3, 2, 1)}, and
R0= {(1, 2, 3), (3, 2, 1)}.
Suppose the agent’s true set of context-dependent preferences is described by R. It is
easy to verify6 that C(R) = C(R0), implying that every choice observation consistent
with maximization of some ordering in R is also consistent with maximization of some
ordering in R0. Then, for every choice of l and any choice dataset D generated by (perfectly)
maximizing orderings in R, we have that
E(D, R0) + l|R0| = 0 + 2l < E(D, R00
) + l|R00|
for every R00 consisting of three or more orderings. Thus, no solution of (3.2) will return
the true number of orderings (3) in the agent’s preference.7
Example 9 Define the sets of orderings
R = {(1, . . . , 9, 10), (1 . . . , 10, 9)}, and
R0= {(1, . . . , 9, 10)}.
Suppose the agent’s true set of context-dependent preferences is described by R. Observe
that the single ordering in R0 explains almost every choice observation that can result from
maximization of orderings in R. The single exception is the observation (x10, {x9, x10}),which is consistent with maximization of the ordering (1, . . . , 10, 9), but not with maxi-
mization of the ordering (1, . . . , 9, 10). Consider any choice dataset D that is generated
6C(R) = C(R0) = {(x1, {x1, x2, x3}), (x1, {x1, x2, }), (x1, {x1, x3}), (x2, {x2, x3}) (x2, {x1, x2}),
(x3, {x2, x3}), (x3, {x1, x3}), (x3, {x1, x2, x3})}.
7An alternative perspective is to consider R and R0 not merely observationally equivalent, butin fact equivalent models. In this case, we can re-interpret the observation as follows: the “morecomplex" description R will never be selected over the “less complex" description R0.
92
by (perfectly) maximizing orderings in R, and does not include (x10, {x9, x10}). Then, for
any choice of l,
E(D, R0) + l|R0| = 0 + l < E(D, R00
) + l|R00|
for every R00 consisting of two or more orderings. So observation of (x10, {x9, x10}) is
necessary to return the true number of orderings (2) using (3.2).
These examples are suggestive of the following: in order to recover the number
of context-dependent orderings, it is necessary that there exist “sufficient differen-
tiation" in the choice implications of these orderings. Following, I define such a
notion of differentiation.
Definition 9 Say that choice problems
A = {(A1, f1), . . . , (Ak, fk)}
are in k-violation of IIA if
1. c f (A) 6= cf 0 (A0
) for every (A, f ), (A0, f 0) 2 A, and
2. c f (A) 2 Tki=1 Ak for every (A, f ) 2 A.
Condition (1) requires that every choice problem in A has a distinct optimal choice,
and condition (2) requires that each of these (distinct) k alternatives is available
in every set Ai for i = 1, . . . , k. Notice that every pair of choice problems from Aconstitutes a (standard) violation of IIA.
Definition 10 The differentiation parameter dR() of a (multi)set of choice problems
A is the largest d such that there exists a partition of A into subsets {A1, . . . ,Ad+1}satisfying
1. |Ai| = K, and
93
2. Ai is in K-violation of IIA.
for every i 2 {1, . . . , d}.
I illustrate this definition on an example.
Example 10 Consider a set of choice alternatives X = {x1, x2, x3, x4, x5}, and a set of
frames F = { fp, fq}. The agent has context-dependent preferences
R = {p,q } = {(5, 4, 3, 2, 1), (1, 2, 3, 4, 5)}.
Suppose the agent maximizes q when there are three or fewer alternatives in the choice set,
and maximizes p otherwise. Let A consist of every choice problem (A, f ) with A 2 and
f 2 F. Then dR(A) = 6, with every pair in
({x1, x2}, fq), ({x1, x2, x3, x4}, fp)
({x1, x3}, fq), ({x1, x2, x3, x5}, fp)
({x1, x4}, fq), ({x1, x2, x4, x5}, fp)
({x1, x5}, fq), ({x1, x3, x4, x5}, fp)
({x2, x3}, fq), ({x2, x3, x4, x5}, fp)
({x2, x4}, fq), ({x1, x2, x3, x4, x5}, fp)
constituting a 2-violation of IIA.
The subsequent recovery results (informally) say the following: if there is
sufficient differentiation between context-dependent orderings and sufficient (not
necessarily complete) sampling of choice problems, then we can recover the number
of context-dependent orderings using Equation (3.2) with probability very close to
1. Theorem 1 applies to general sets of choice problems. Corollary 2 considers a
particular data-generating process in which M choice problems are sampled uni-
formly at random from P . Throughout, I take M to be the number of observations
94
in the data, N to be the number of alternatives, p to be the probability of error,
and dR(A) to be the differentiation parameter of A given the agent’s preferences
R. When there is no chance of confusion, I express dR() simply as d().
Theorem 3 Let be any (multi)set of M choice problems. Suppose, for some constant
b > 0,
d := d() >2p + b
(1 � p)K M (3.3)
Then for any d 2⇣
0, d(1�p)K
M � 2p � b⌘
, there exists a constant c > 1 such that the
optimization problem in Eq. (3.2) with l =
1(p+d)M exactly recovers |R| = K with
probability at least 1 � O�
c�M�.
I provide a brief proof sketch here and defer the details to the appendix. Identify
every dataset with an undirected (hyper)graph8 in the following way: nodes
represent choice alternatives, and there is an edge between a set of observations if
and only if these observations cannot be rationalized using the same preference
ordering. The key observation in the proof is that a dataset is k-rationalizable if
and only if the corresponding graph is k-colorable9. This equivalence is shown by
taking each color class to represent consistency with a distinct ordering. Thus, the
problem in 3.2 can be seen as finding the smallest number of colors k such that the
greatest number of nodes are k-colorable.
Fix any (multi-)set of choice problems A and suppose for the moment that
there is no choice error. Since the data is generated by perfect maximization of K
orderings, the corresponding hypergraph admits a K-coloring. Moreover, notice
that every set of observations in a K-violation of IIA constitutes a complete K-
partite subgraph. Since by assumption, the data includes at least d such sets, the
8A hypergraph is a generalization of a graph in which edges may connect more than two vertices.
9A k-coloring of a graph is a partition of its vertex set V into k color classes such that no edge inE is monochromatic. A graph is k-colorable if it admits an k-coloring.
95
Consistent with maximization of r1Consistent with maximization of r2Consistent with maximization of r3
Figure 3.2: Studying rationalizability of a dataset is equivalent to studying colorability of a graph in whichnodes represent observations and edges represent inconsistencies.
corresponding hypergraph includes at least d complete K-partite subgraphs. So it
cannot be colored by fewer than K colors.
Now introduce choice error. Each node is “corrupted" with probability p,
following which its edges are re-arranged (the node is removed from some edges
to which it belongs, and new edges between this node and others are introduced). I
show that if choice error is introduced at a sufficiently low probability, then enough
complete K-partite graphs remain in the perturbed graph such that “most" nodes
can be partitioned into K (but not fewer) colors.
The following corollary presents recovery properties for a particular, convenient,
choice of l.
Corollary 3 Let be any (multi)set of M choice problems. Suppose p 0.05 and d(A) >
0.25(0.95)K M. The optimization problem in Eq. (3.2) with l =
10.1M exactly recovers |R| = K
with probability at least 1 � O�
e�0.005M�.
Since the number of non-overlapping sets of size K from a set of M elements
96
cannot exceed MK , Condition (3.3) in Theorem 1 implies a tradeoff between the
number of orderings that can be recovered and the probability of error that can
be tolerated. For example, if p = 0.05 (5% probability of error) the theorem does
not apply to sets with more than 6 orderings, and if p = 0.01 (1% chance of error),
the theorem does not apply to sets with more than 35 orderings. In Corollary 1,
the (stronger) requirement d() > 0.250.95K M is satisfied only by sets including three or
fewer orderings. These strict thresholds are not necessary conditions for recovery,
and can be relaxed in future work10.
Below, I illustrate the implications of this approach and choice of l using a
problem of preference elicitation considered in Crawford and Pendakur (2012).
Example 11 Crawford and Pendakur (2012) study preferences over six different types of
milk, using a dataset including 500 Danish households and their purchases. Unsurprisingly,
no single utility function can explain all 500 observations. To accommodate heterogeneity
in preference, Crawford and Pendakur (2012) suggest an application of Kalai et al. (2002),
in which the minimal number of utility functions that explain all of the data is found. They
find that in fact 4-5 utility functions are sufficient to explain all of the data. This is a perfect
multiple-preference fit to the data.
But they further comment that the fifth utility function explains only 8 out of 500
observations, and drop this utility function in many of their latter analyses. This highlights
a limitation of the approach proposed in Kalai et al. (2002): the approach ignores variation
in the strength of evidence for recovered preferences.
We can extend the approach in Kalai et al. (2002) by finding a “best" imperfect multiple-
preference fit. Under the choice of l =
10.1(500) =
150 proposed in Corollary 3, preferences
exist in a best multiple-ordering rationalization only if they uniquely explain at least 50
10One way to relax this restriction is to count the number of possibly overlapping sets of choiceproblems that constitute an K-violation of IIA.
97
observations. The fifth utility function does not satisfy this criterion, so the remaining 8
observations are interpreted as choice error.11
Corollary 4 considers a related context in which the set of choice problems is
not fixed by the analyst, but generated by uniform sampling over the set of possible
choice problems P = {(A, f ) : A 2, f 2 F}. A similar result obtains provided the
differentiation parameter d(P) is sufficiently high.
Corollary 4 Suppose consists of M choice problems sampled uniformly at random from
P , and
d := d() >2p + b
(1 � p)K
⇣
2NK⌘
(3.4)
for some constant b > 0. Then for any d 2⇣
0, d(1�p)K
M � 2p � b⌘
, there is a constant
c > 1 such that the optimization problem in Eq. (3.2) with l =
1(p+d)M exactly recovers
|R| = K with probability at least 1 � O�
c�M�.
The details of the proof are deferred to the appendix.
3.5.3 Can we recover more?
Section 4.1 provides conditions under which the problem in (3.2) recovers the
correct number of context-dependent orderings with high probability. Is it possible
to recover the context-dependent orderings themselves? First, I show that even in
the absence of choice error (p = 0), recovery of multiple preferences from choice
data is in general an ill-posed task. From Proposition 7, no set of three or more
context-dependent orderings can be uniquely recovered, and only special pairs of
context-dependent orderings can be recovered.
11It is important to note, however, that Crawford and Pendakur (2012) consider utility functionsdefined on the continuous space of price-quantity pairs, whereas my recovery results in Section 5pertain to orderings defined on finite sets of alternatives.
98
Next, I discuss the possibility of identifying an equivalence class (in choice
implications) containing the true set of context-dependent orderings. I provide an
example to illustrate that even this weaker notion of identifiability is not met. I
suggest that a more nuanced notion of complexity than cardinality is needed for
recovery of sets of context-dependent orderings, and conclude with preliminary
comments toward extension.
In the following, I say that R is identified if there exists data D such that
{R} = argminR02R
�|R0| | E(D, R0) = 0
. (3.5)
That is, there exists some set of choice observations (possibly including observation
of multiple choices from the same choice set) such that R is the unique set of k |R|orderings that perfectly explains the data. We can think of the data as “revealing" R
to the data analyst. Otherwise, say that R is not identified. The following observation
provides an equivalent characterization.
Observation 4 R is identified if and only if there does not exist R0 6= R satisfying
|R0| |R| and E(C(R), R0) = 0.
Thus, if there exists any data which identifies R, then the dataset C(R) will
identify R.
Example 12 Fix X = {x1, x2, x3} and R = {(3, 2, 1), (3, 1, 2)}. The choice implications
of R
C(R) ={(x1, {x1, x2}), (x1, {x1, x3}), (x2, {x2, x3}),
(x3, {x2, x3}), (x1, {x1, x2, x3})}
99
is a subset of the choice implications of R0, for every
R0= {{(3, 2, 1), (2, 1, 3)}, {(3, 2, 1), (1, 2, 3)},
{(2, 3, 1), (3, 1, 2)}, {(1, 3, 2), (3, 1, 2)}}.
So every dataset that can be perfectly rationalized by the orderings in R can be perfectly
rationalized by the orderings in R0 2 R. Therefore, R is not identified.
The following proposition establishes non-identifiability of most non-singleton
sets. Specifically, every set with at least three orderings is not identified, and a
pair of orderings is identified if and only if the orderings differ at extremes (the
maximal element in the first ordering is the minimal element in the second, and
vice versa).
Proposition 7 If |R| � 3, then R is not identified. If R = {1,2 }, then R is identified if
and only if argmaxi rji = argmini r3�j
i for j = 1, 2. Every set R = {} is identified.
The proof is deferred to the appendix. Proposition 1 leaves open the possibility
of recovering an equivalence class (in choice implications) containing the true set
of orderings. Say that R ⇠ R0 if C(R) = C(R0), and let
[R] = {R0 2 R | R0 ⇠ R}
be the equivalence class of R induced by ⇠. Every set of orderings in [R] has the
same choice implications as R, so that an observation can be rationalized using an
ordering in R if and only if it can be rationalized using an ordering in R0. Say that
R is choice-identified if there exists data D such that
[R] = argminR02R
�|R0| | E(D, R0) = 0
.
That is, there exists some set of choice observations D such every observation in D
100
is consistent with maximization of some ordering in R0 if and only if C(R0) = C(R).
The following negative example shows that even this weaker kind of identifiability
need not be satisfied by typical sets of orderings.
Example 13 Define
R = {(1, 2, 3, 4, 5), (2, 1, 3, 4, 5)}, and
R0= {(1, 2, 3, 4, 5), (5, 4, 3, 2, 1)}
Every choice observation consistent with maximization of some ordering in R is also
consistent with maximization of some ordering in R0, but the converse does not hold. So
C(R) ⇢ C(R0), implying both that R0 /2 [R], and also that every dataset that can be
perfectly rationalized using the orderings in R can be perfectly rationalized using the
orderings in R0. It follows that [R] is not choice-identified.
The example above highlights the (general) phenomenon that across sets with a
fixed number of orderings, there is variation in the “richness" of choice implication.
Since the approach in (3.2) penalizes all sets consisting of the same number of
orderings equally, it is biased towards elicitation of sets with richer choice implica-
tions. That is, if the decision maker’s context-dependent preferences are many but
similar, the proposed approach will incorrectly interpret the data using orderings
that are fewer but “more different". It may be possible to recover equivalence
classes by extending the approach in (3.2) to loss functions of the form
E(D, R) + l f (R),
where f penalizes the “richness" or “expressiveness" of the orderings in R, instead
of the cardinality. I leave this extension for future work.
101
3.6 Relationship to Literature
This paper extends ideas in Kalai et al. (2002), which defines a set of orderings
{�i}Li=1 as a rationalization by multiple rationales of choice function c :! X if for
every choice set A 2, the selected alternative c(A) is �i-maximal in A for some
i = 1, 2, . . . , L. Using the notation of Section 2, any set of orderings R with choice
error E(D, R) = 0 is a rationalization by multiple rationales of the dataset D.
This set of orderings may not, however, correspond to a best multiple-ordering
rationalization of the data as defined in (3.2). In particular, I suggest that the analyst
may prefer an imperfect rationalization of the data using some K < L orderings to
perfect rationalization of the data using L orderings. The key conceptual difference
is that Kalai et al. (2002) is agnostic towards the “degree of evidence" for any
particular ordering �k, whereas the approach in this paper insists on sufficient
evidence for each ordering in order to separate error from preference variation.
The model of choice that I consider throughout is an extension of frame-
dependent preferences proposed independently in Salant and Rubinstein (2008)
and Bernheim and Rangel (2009), with the addition of choice error. In each of these
papers, the standard model is enriched by a set F of contexts.12 A choice problem13
is defined as a pair (A, f ) where A ✓ X is a choice set and f 2 F is a context. An
extended choice function c assigns to every extended choice problem (A, f ) an
element of A. I consider an extension of this model to allow for probability of error,
so that c(A, f ) is chosen with probability at least 1 � p, but with probability p the
agent trembles.
Although my model of choice is very similar, the goals of this paper are very
12Frames in Salant and Rubinstein (2008), and ancillary conditions in Bernheim and Rangel (2009)
13Extended choice problem in Salant and Rubinstein (2008), and generalized choice situation in Bernheimand Rangel (2009).
102
different. Salant and Rubinstein (2008) characterizes the choice correspondence
Cc(A) = {x | c(A, f ) = x for some f 2 F}, and Bernheim and Rangel (2009) pro-
poses a framework for welfare assessment. This paper studies the question of
whether it is possible to recover the number of contexts in F, using choice data
alone. My results in Section 5 show that recovery of context-dependent preferences
given the model proposed in Salant and Rubinstein (2008) and Bernheim and
Rangel (2009) is an ill-posed problem, but recovery of the number of contexts is
possible even under choice error.
This paper is related also to Ambrus and Rozen (2013), which shows that
without prior restriction on the number of selves involved in a decision, many
multi-self models have no testable implications. Although the set of choice models
considered in Ambrus and Rozen (2013) is distinct from the set of choice models
considered in my paper,14 their lesson that restricting the number of selves is
important for constraining the available degrees of freedom holds in my domain
as well, and motivates in part the suggested criterion in (3.2).
Finally, the applications that I suggest are related to exercises undertaken in
Crawford and Pendakur (2012) and Dean and Martin (2010), which respectively
apply the approaches of Kalai et al. (2002) and Houtman and Maks (1985) to
interpret inconsistent choice data.
Conclusion
Inconsistencies in choice data may emerge either from choice error, or from maxi-
mization of multiple orderings. It is important to separate the two in analysis of
14Ambrus and Rozen (2013) study multi-self models in which every self is active in every decision,and choice is determined through maximization of a choice-set independent aggregation rule overselves. In contrast, I study multi-self models in which every self acts as a “dictator" in a subset ofchoices, thus varying the aggregation rule across choice problems.
103
the data, since their implications for welfare assessment and prediction are very
different. But how does the analyst know how many distinct orderings are being
maximized? This paper suggests use of statistical regularization to recover a small
number of context-dependent preferences from noisy choice data. I show that with
probability exponentially close to 1, the proposed approach is able to recover the
true number of context-dependent preferences. This provides an alternative to
existing approaches, which deliver either a single “best-fit" ordering or multiple
“perfect-fit" orderings.
104
References
Acemoglu, D., Chernozhukov, V. and Yildiz, M. (2015). Fragility of asymptoticagreement under bayesian learning. Theoretical Economics.
Al-Najjar, N. (2009). Decisionmakers as statisticians: Diversity, ambiguity, andlearning. Econometrica.
Ambrus, A. and Rozen, K. (2013). Rationalizing choice with multi-self models.Economic Journal.
Aumann, R. J. (1976). Agreeing to disagree. The Annals of Statistics.
Bar-Hillel, M. and Wagenaar, W. (1991). The perception of randomness. Advancesin Applied Mathematics.
Barberis, N., Shleifer, A. and Vishny, R. (1998). A model of investor sentiment.Journal of Financial Economics.
Battigalli, P. and Sinischalchi, M. (2003). Rationalization and incomplete infor-mation. Advances in Theoretical Economics.
Belloni, A. and Chernozhukov, V. (2011). L1-penalized quantile regression inhigh-dimensional sparse models. Annals of Statistics.
—, —, Chen, D. and Hansen, C. (2012). Sparse models and methods for instru-mental regression, with an application to eminent domain. Econometrica.
—, — and Wang, L. (2011). Square-root lasso: Pivotal recovery of sparse signalsvia conic programming. Biometrika.
Bernheim, B. D. and Rangel, A. (2009). Beyond revealed preference: Choicetheoretic foundations for behavioral welfare economics. Quarterly Journal ofEconomics.
Beshears, J., Choi, J., Laibson, D. and Madrian, B. (2008). The Importance of DefaultOptions for Retirement Saving Outcomes: Evidence from the United States, OxfordUniversity Press, pp. 59–87.
105
Billot, A., Gilboa, I., Samet, D. and Schmeidler, D. (2005). Probabilities assimilarity-weighted frequencies. Econometrica.
Block, H. and Marshak, J. (1960). Random orderings and stochastic theories ofresponse. Contributions to Probability and Statistics.
Brandenburger, A. and Dekel, E. (1993). Hierarchies of belief and commonknowledge. Journal of Economic Theory.
Camerer, C. (1989). Does the basketball market believe in the ‘hot hand’? AmericanEconomic Review.
Carlin, B. I., Kogan, S. and Lowery, R. (2013). Trading complex assets. The Journalof Finance.
—, Longstaff, F. A. and Matoba, K. (2014). Disagreement and asset prices. Journalof Financial Economics.
Carlsson, H. and van Damme, E. (1993). Global games and equilibrium selection.Econometrica, 61 (5), 989–1018.
Chen, D., Moskowitz, T. and Shue, K. (). Decision-making under the gambler’sfallacy: Evidence from asylum judges, loan officers, and baseball umpires,working Paper.
Chen, Y.-C., di Tillio, A., Faingold, E. and Xiong, S. (2010). Uniform topologieson types. Theoretical Economics.
Crawford, I. and Pendakur, K. (2012). How many types are there? EconomicJournal.
Cripps, M., Ely, J., Mailath, G. and Samuelson, L. (2008). Common learning.Econometrica.
Croson, R. and Sundali, J. (2005). The gambler’s fallacy and the hot hand: Empir-ical data from casinos. Journal of Risk and Uncertainty.
Dean, M. and Martin, D. (2010). How consistent are your choice data?, workingPaper.
Dekel, E., Fudenberg, D. and Levine, D. (2004). Learning to play bayesian games.Games and Economic Behavior.
—, — and Morris, S. (2006). Topologies on types. Theoretical Economics.
—, — and — (2007). Interim correlated rationalizability. Theoretical Economics.
Donoho, D. L. and Huo, X. (2001). Uncertainty principles and ideal atomic decom-position. IEEE Transactions on Information Theory.
106
Edwards, W. (). Probability learning in 1000 trials. Journal of Experimental Psychology.
Einav, L., Finkelstein, A., Pascu, I. and Cullen, M. (2012). How general are riskpreferences? choices under uncertainty in different domains. American EconomicReview.
Elad, M. and Bruckstein, A. (2002). A generalized uncertainty principle andsparse representation. IEEE Transactions on Information Theory.
Esponda, I. (2013). Rationalizable conjectural equilibrium: A framework for robustpredictions. Theoretical Economics.
Eyster, E. and Piccione, M. (2013). An approach to asset pricing under incompleteand diverse preferences. Econometrica.
Falk, R. and Konald, C. (1997). Making sense of randomness: Implicit encodingas a basis for judgment. Psychological Review.
Fudenberg, D., Iijima, R. and Strzalecki, T. (2015). Stochastic choice and revealedperturbed utility. Econometrica.
—, Kreps, D. and Levine, D. (1988). On the robustness of equilibrium refinements.Journal of Economic Theory.
Gabaix, X. (2014). A sparsity-based model of bounded rationality, with applicationto basic consumer and equilibrium theory. Quarterly Journal of Economics, workingPaper.
Gayer, G., Gilboa, I. and Lieberman, O. (2007). Rule-based and case-based reason-ing in housing prices. The B.E. Journal of Theoretical Economics.
Geanakoplos, J. and Polemarchakis, H. (1982). We can’t disagree forever. Journalof Economic Theory.
Gilboa, I., Lieberman, O. and Schmeidler, D. (2006). Empirical similarity. Reviewof Economics and Statistics.
—, Samuelson, L. and Schmeidler, D. (2013). Dynamics of inductive inference ina unified framework. Journal of Economic Theory.
— and Schmeidler, D. (2003). Inductive inference: An axiomatic approach. Econo-metrica.
Gillovich, T., Vallone, R. and Tversky, A. (1985). The hot hand in basketball: Onthe misperception of random sequences. Cognitive Psychology.
Hansen, L. P. (2014). Uncertainty inside and outside economic models. Journal ofPolitical Economy.
107
— and Sargent, T. (2010). Fragile beliefs and the price of uncertainty. QuantitativeEconomics.
— and — (2012). Three types of ambiguity. Journal of Monetary Economics.
— and Sargent, T. J. (2007). Robustness. Princeton University Press.
Houtman, M. and Maks, J. (1985). Determining all maximal data subsets consistentwith revealed preference. Kwantitatieve Methoden, 19, 89–104.
Huang, P.-S., Chen, S. D., Smaragdis, P. and Hasegawa-Johnson, M. (2013).Singing-voice separation from monaural recordings using robust principal com-ponent analysis. International Conference on Acoustics, Speech, and Signal Processing.
Huber, J., Payne, J. and Puto, C. (1982). Adding asymmetrically dominatedalternatives: Violations of regularity and the similarity hypothesis. Journal ofConsumer Research.
Jackson, M. O., Kalai, E. and Smorodinsky, R. (1999). Bayesian representation ofstochastic processes under learning: De finetti revisited. Econometrica.
Kahneman, D. and Tversky, A. (2000). Choices, Values, and Frames. The PressSyndicate of the University of Cambridge.
Kajii, A. and Morris, S. (1997). The robustness of equilibria to incomplete infor-mation. Econometrica.
Kalai, G., Rubinstein, A. and Spiegler, R. (2002). Rationalizing choice functionsby multiple rationales. Econometrica, 70 (6), 2481–2488.
Kandel, E. and Pearson, N. (1995). Differential interpretation of information andtrade in speculative markets. Journal of Political Economy.
Kohlberg, E. and Mertens, J.-F. (1986). On the strategic stability of equilibria.Econometrica.
Mankiw, G., Reis, R. and Wolfers, J. (2004). Disagreement about inflation expecta-tions. NBER Macroeconomics Annual 2003.
Marlon, J., Leiserowitz, A. and Feinberg, G. (2013). Scientific and Public Perspec-tives on Climate Change. Tech. rep., Yale Project on Climate Change Communica-tion.
Mertens, J.-F. and Zamir, S. (1985). Formulation of bayesian analysis for gameswith incomplete information. International Journal of Game Theory.
Monderer, D. and Samet, D. (1989). Approximating common knowledge withcommon beliefs. Games and Economic Behavior.
108
Morris, S., Rob, R. and Shin, H. S. (1995). p-dominance and belief potential.Econometrica.
— and Takahashi, S. (). Strategic implications of almost common certainty ofpayoffs.
—, — and Tercieux, O. (2012). Robust rationalizability under almost commoncertainty of payoffs. The Japanese Economic Review.
Myerson, R. B. (1978). Refinements of the nash equilibrium concept. InternationalJournal of Game Theory.
Nickerson, R. and Butler, S. (2009). On producing random sequences. AmericanJournal of Psychology.
Peysakhovich, A. and Naecker, J. (2016). Evaluating models of choice under riskand ambiguity using methods from machine learning, working Paper.
Rabin, M. (2002). Inference by believers in the law of small numbers. The QuarterlyJournal of Economics.
— and Vayanos, D. (2010). The gambler’s and hot-hand fallacies: Theory andapplications. Review of Economic Studies.
Rapaport, A. and Budescu, D. (1997). Randomization in individual choice behavior.Psychological Review.
Rath, G. (1966). Randomization by humans. The American Journal of Psychology.
Ren, X., Zhang, Z. and Ma, Y. (2012). Repairing sparse low-rank texture. Journal ofLaTeX Class Files.
Rubinstein, A. (1989). The electronic mail game: Strategic behavior under "almostcommon knowledge". American Economic Review, 79.
Salant, Y. and Rubinstein, A. (2008). (a,f): Choice with frames. The Review ofEconomics Studies.
Selten, R. (1975). Reexamination of the perfectness concept for equilibrium pointsin extensive games. International Journal of Game Theory.
Steiner, J. and Stewart, C. (2008). Contagion through learning. Theoretical Eco-nomics.
Tversky, A. and Kahneman, D. (1971). The belief in the law of small numbers.Psychological Bulletin.
Wagenaar, W. (1972). Generation of random sequences by human subjects: Acritical survey of the literature. Psychological Bulletin.
109
Weinstein, J. and Yildiz, M. (2007). A structure theorem for rationalizability withapplication to robust prediction of refinements. Econometrica.
Wiegel, D. (2009). How many southern whites believe obama was born in america?Washington Independent.
Yang, A., Zhou, Z., Ganes, A., Sastry, S. S. and Ma, Y. (2013). Fast l1-minimizationalgorithms for robust face recognition. Computer Vision and Pattern Recognition.
Yu, J. (2011). Disagreement and return predictability of stock portfolios. Journal ofFinancial Economics.
110
Appendix A
Appendix to Chapter 1
A.1 Notation and Preliminaries
• If (X, d) is a metric space with A ✓ X and x 2 X, I write
d(A, x) = supx02A
d(x0, x).
• Int(A) is used for the interior of the set A.
• Recall that u 2 U is a payoff matrix. For clarity, I will sometimes write ui to
denote the the payoffs in u corresponding to agent i, and u(a, q) to denote
g(q)(a).
• For any µ, n 2 D(Q), the Wasserstein distance is given by
W1(µ, n) = inf E(X, Y),
where the expectation is taken with respect to a Q ⇥ Q-valued random
variable and the infimum is taken over all joint distributions of X ⇥ Y with
marginals µ and n respectively.
111
A.2 Preliminary Results
Lemma 4 The function
h(µ) =Z
Qg(q)dµ 8 µ 2 D(Q)
is continuous.
Proof 9 By assumption, g is Lipschitz continuous; let K < • be its Lipschitz constant
(assuming the sup-metric on U). Suppose dP(µ, µ0) e; then,
kh(µ)� h(µ0)k• =
�
�
�
�
Z
Qg(q)d(µ � µ0
)
�
�
�
�
• K sup
f2BL1(Q)
�
�
�
�
Z
Qf (q)d(µ � µ0
)
�
�
�
�
•
= KW1(µ, µ0)
K(diam(Q) + 1)dP(µ, µ0)
K(diam(Q) + 1)e
using the assumption of Lipschitz continuity in the first inequality, and compactness of Q
and the Kantorovich-Rubinstein dual representation of W1 in the following equality. The
second inequality follows from Theorem 2 in metric. So h is continuous.
Lemma 5 If dP(FZn , dq⇤) ! 0 a.s. , then also
dP (Conv(
FZn) , dq⇤) ! 0 a.s.
where Conv(FZn) denotes the convex hull of FZn .
Proof 10 Fix any dataset zn, constant a 2 [0, 1], and measures µ, µ0 2 Fzn . Again using
112
the dual representation,
W1(aµ + (1 � a)µ0, dq⇤) = supf2BL1(Q)
✓
Z
f (q)(.(aµ + (1 � a)µ0)� dq⇤)
◆
= supf2BL1(Q)
a
✓
Z
f (q)(.µ � dq⇤)
◆
+ (1 � a)
✓
Z
f (q)(.µ0 � dq⇤)
◆
a supf2BL1(Q)
✓
Z
f (q)(.µ � dq⇤)
◆
+ (1 � a) supf2BL1(Q)
✓
Z
f (q)(.µ0 � dq⇤)
◆
= aW1(µ, dq⇤) + (1 � a)W1(µ0, dq⇤) sup
µ2Fzn
W1(µ, dq⇤)
Moreover, using Theorem 2 in metric,
dP(aµ + (1 � a)µ0, dq⇤)2 W1(aµ + (1 � a)µ0, dq⇤),
and also
supµ2Fzn
W1(µ, dq⇤) (1 + diam(Q)) supµ2Fzn
dP(µ, dq⇤).
Thus, for every dataset zn,
dP(Conv(
Fzn) , dq⇤)2 (1 + diam(Q)) sup
µ2Fzn
dP(µ, dq⇤),
where diam(Q) is finite by compactness of Q. So dP(FZn , dq⇤) ! 0 a.s. implies
dP(Conv(
FZn) , dq⇤) ! 0 a.s., as desired.
Claim 6 Fix any agent i, and let tq⇤ be the type with common certainty in q⇤. If action
ai is strongly strict-rationalizable for agent i with type tq⇤ , then it is also weakly strict-
rationalizable for agent i in the complete information game with payoffs u⇤= g(q⇤).
Proof 11 By induction. Trivially R1j [tq⇤ ] = W1
j = Aj for every agent j. If aj /2 W2j , then
it is not a strict best response to any distribution over opponent actions, so also aj /2 R2j [tq⇤ ].
113
Thus,
R2j [tq⇤ ] ✓ W2
j 8 j.
Now, suppose Rkj [tq⇤ ] ✓ Wk
j for every agent j, and consider any agent i and action
ai 2 Rk+1i [tq⇤ ]. By construction of the set Rk+1
i [tq⇤ ], there exists some distribution p with
margQ⇥T�ip = ktq⇤ and p
�
a�i 2 Rk�i[t�i]
�
= 1 such that
Z
Q⇥T�i⇥A�i
ui(ai, a�i, q)p. >Z
Q⇥T�i⇥A�i
ui(a0i, a�i, q)p. + d 8a0i 6= ai.
But since Rki [ti] ✓ Wk
i , the distribution p also satisfies p�
a�i 2 Wki�
= 1. So ai is
a d-best response to some distribution p with support in the surviving set of weakly
strict-rationalizable actions, implying that ai 2 Wk+1i , as desired.
A.3 Appendix C: Main Results
A.3.1 Proof of Claim 1
I use the following notation. For every dataset zn = {(xk, p(xk)}nk=1, define
F(zn) = {p(0) : p 2 P and p(xk) = p(xk) 8 k = 1, . . . n}
and let Tzn be the set of hierarchies of belief with common certainty in F(zn).
(See footnote 8 for the definition of P.) Also, let t�1 be the type with common
certainty in �1, and let t1 be the type with common certainty in 1. Observe that R
is rationalizable for type t1 and not for type t�1.
Suppose F(zn) = {�1, 1}. Then t�1 2 Tzn , so there is a type in Tzn for whom
R is not rationalizable. Now suppose F(zn) = {1}. Then, the only permitted
type is t1, so R is trivially rationalizable for every type in Tzn . It follows that R is
rationalizable for every type in Tzn if and only if F(zn) = {1}; that is, if and only if
every inference rule p 2 P that exactly fits zn predicts p(0) = 1. For what datasets
114
zn does this hold?
We can reduce this problem by looking at whether the smallest hyper-rectangle
that contains every successful observation also contains the origin. This will be the
case if and only if for every dimension k, there exist observations (xi, 1) and (xj, 1)
such that xki < 0 and xk
j > 0 (that is, the k-th attribute is negative in some observed
high yield region, and positive in some observed high yield region). For every k,
this probability is
1 �
2✓
2c � c0
2c
◆n�✓
c � c0
c
◆n�
.
Realization of k-th attributes are independent across dimensions. Thus, the proba-
bility that this holds for every dimension is✓
1 �
2✓
2c � c0
2c
◆n�✓
c � c0
c
◆n�◆r
as desired.
A.3.2 Proof of Proposition 1
The proof of this proposition follows from two lemmas. The first is a straightfor-
ward generalization of Proposition 6 in faingold1, and relates common learning
to convergence of types in the uniform-weak topology. The second lemma says
that for every dataset z, the distance between tiz and tq⇤ is upper bounded by
dP(Fzn , dq⇤).
Throughout, I use tq⇤ to denote the type with common certainty in q⇤.
Lemma 6 Agent i commonly learns q⇤ if and only if
dUWi (ti
Zn, tq⇤) ! 0 a.s. as n ! •.
1This lemma appears in faingold for the case in which Q is a finite set and d0 is the discretemetric, but generalizes to any complete and separable metric space (Q, d0
) when the definition ofcommon learning is replaced by Definition 2.
115
Thus, the problem of determining whether an agent i commonly learns q is
equivalent to that of determining whether his random type tiZn
almost surely
converges to tq⇤ in the uniform-weak topology.
Lemma 7 For every dataset z.
dUWi (ti
z, tq⇤) dP(Fz, dq⇤) (A.1)
Proof 12 Fix any dataset z. It is useful to decompose the set of types Tz into the Cartesian
product ’•k=1 Hk
z, where H1z = Fz and for each k > 1, Hk
z is recursively defined
Hkz =
n
tk 2 Tk :�
margTk�1 tk�(Hk�1
z ) = 1 and margQ tk 2 H1z
o
; (A.2)
that is, Hkz consists of the k-th order beliefs of types in Tz. First, I show that every k-th
order belief in the set Hkz is within dP(Fz, dq⇤) (in the dk metric2) of the k-th order belief of
tq⇤ .
Claim 7 Define d⇤ = dP(Fz, dq⇤). For every k � 1,
Hkz ✓
n
tkq⇤od⇤
:=n
tk 2 Tk : dk(
t, tq⇤) do
.
Proof 13 Fix any t 2 Tz. By construction of Tz, the first-order belief of type t is in the set
Fz. So it is immediate that
d1(t, tq⇤) dP(Fz, dq) = d⇤. (A.3)
Now suppose Hkz ✓ �
tkq⇤ d⇤ . Then, since tk+1
⇣
�
tkq⇤ d⇤⌘
� tk+1(Hk
z) = 1 from (A.2),
and tk+1q⇤�{tk
q⇤}�
by definition of the type tq⇤ , it follows that
tk+1q⇤ (E) tk+1
(Ed⇤) + d⇤
2See Section 3.2.
116
for every measurable E ✓ Tk. Using this and (A.3),
dk+1(t, tq⇤) d⇤. (A.4)
as desired.
So dk(t, tq⇤) d⇤ for every k, implying dUW
i (t, tq⇤) = supk�1 dk(t, tq⇤) d⇤.
Thus, the question of convergence of types is reduced to the question of convergence
in distributions over Q. The remainder of the argument is now completed:
Fix any map ti : z 7! tiz such that ti
z 2 Tz for every z. Suppose M is uniformly
consistent; then supµ2M d(
µZn , dq⇤) ! 0 a.s.3. It follows from Lemma 6 that
dUWi (ti
Zn, tq⇤) ! 0 a.s.,
so that agent i’s (interim) type tiZn
almost surely converges to tq⇤ . Using Lemma 7,
agent i commonly learns q.
For the other direction, suppose M is not uniformly consistent. Then, there
exist constants e, d > 0 such that for n sufficiently large,
supµ2M
d(
µ(zn), dq⇤)) > e (A.5)
for every zn in a set Z⇤n of Pn-measure d. Define the map ti such that for every
dataset zn 2 Z⇤n , agent i’s first-order belief is µ(zn) for some µ 2 M satisfy-
ing d(µ(zn), dq⇤) > e (existence guaranteed by (A.5)). Then d1(ti
Zn, tq⇤)0, so also
dUWi (ti
Zn, tq⇤)0, and it follows from Lemma 6 that agent i does not commonly learn
q.
3Uniform convergence in W1 implies uniform convergence in the Prokhorov metric d. See forexample metric.
117
A.3.3 Proof of Claim 3
I prove this claim in two parts. Recall that UNEa is the set of all complete information
games in which a is a Nash equilibrium. Thus, the set h�1 �UNEa�
is the set of all
distributions over Q that induce an expected payoff in UNEa . The first claim says
that dq⇤ 2 Int�
h�1 �UNEa��
if and only if h(FZn) is almost surely contained in UNEa
as the quantity of data n tends to infinity.
Claim 8 limn!• h(
FZn) ✓ UNEa a.s. if and only if dq⇤ 2 Int
�
h�1 �UNEa��
.
Proof 14 Sufficiency. Suppose dq⇤ 2 Int�
h�1 �UNEa��
. Recall that under uniform
consistency, W1(FZn , dq⇤) ! 0 a.s., so that
limn!•
FZn ✓ V a.s.
for every open set V with dq⇤ 2 V. This implies in particular that
limn!•
FZn ✓ h�1(UNE
a ) a.s.
Using continuity of h (see Lemma 4), it follows from the continuous mapping theorem that
limn!•
h(
FZn) ✓ UNEa a.s.
as desired.
Necessity. Suppose dq⇤ /2 Int(h�1(UNE
a )). Under assumption NI, there exists a
constant d > 0 independent of n, and a set Z⇤n of measure d, such that
dq⇤ 2 Int(Fzn) 8 zn 2 Z⇤n .
Consider any dataset zn 2 Z⇤n . Since dq⇤ /2 Int
�
h�1 �UNEa��
, necessarily Fzn h�1 �UNEa�
.
It follows that
limn!•
Pn⇣n
zn : h(
Fzn) ✓ UNEa
o⌘
< 1
118
as desired.
Claim 9 dq⇤ 2 Int�
h�1 �UNEa��
if and only if u⇤ 2 Int�
UNEa�
.
Proof 15 Suppose u⇤ 2 Int�
UNEa�
. Then, there is an open set V such that
u⇤ 2 V ✓ UNEa .
Since h is continuous (see Lemma 4), h�1(V) is an open set in D(Q). So
dq⇤ 2 h�1(V) ✓ h�1
⇣
UNEa
⌘
implying that dq⇤ 2 Int�
h�1 �UNEa��
, as desired.
For the other direction, suppose towards contradiction that dq⇤ 2 Int�
h�1 �UNEa��
but
u⇤ /2 Int�
UNEa�
. Since u⇤ is on the boundary of UNEa , there exists some agent i and action
a0i 6= ai such that
u⇤i (a0i, a�i) � u⇤
i (a0i, a�i).
Under assumption 4, g(Q) has nonempty intersection with S(i, ai), so there exists some
q 2 g�1(S(i, ai)). For every e > 0, define
µe = (1 � e)dq⇤ + edq .
The expected payoff under µe satisfies
Z
Uui(a0i, a�i)g. ⇤(µe) >
Z
Uui(ai, a�i)g. ⇤(µe)
where g⇤(n) denotes the push forward measure of n 2 D(Q) under the map g. So ai is not a
best response to a�i given beliefs µe over Q, and therefore h(µe) /2 UNEa . This implies also
µe /2 h�1 �UNEa�
. Thus the sequence µe ! dq⇤ and has the property that µe /2 h�1 �UNEa�
for every e, so dq⇤ /2 Int(h�1(UNE
a )), as desired.
119
A.3.4 Proof of Theorem 2
Only if: Define URa⇤i✓ U to consist of all payoffs u such that a⇤i is rationalizable for
player i in the complete information game with payoffs u.
Lemma 8 u 2 Int⇣
URa⇤i
⌘
if and only if a⇤i survives every round of weak strict-rationalizability
in the complete information game with payoffs u.
Proof 16 Only if: Suppose a⇤i fails to survive some iteration of weak strict-rationalizability.
Then, there exists a sequence of sets⇣
Wkj
⌘
k�1for every agent j satisfying the recursive
description in Section 5.1, such that a⇤i /2 WKi for some K < •. To show that u /2
Int⇣
URa⇤i
⌘
, I construct a sequence of payoff functions un with un ! u (in the sup-metric)
such that a⇤i is not rationalizable in any complete information game with payoffs along this
sequence, for n sufficiently large.
For every n � 1, define the payoff function un as follows. For every agent j, let un,1j
satisfy
un,1j (aj, a�j) = uj(aj, a�j) + e/n 8 aj 2 Wk�1
j and 8 a�j 2 A�j
un,1j (aj, a�j) = uj(aj, a�j) otherwise.
Recursively for k � 1, let un,kj satisfy
un,kj (aj, a�j) = un,k�1
j (aj, a�j) + e/n 8 aj 2 Wk�1j and 8 a�j 2 A�j
un,kj (aj, a�j) = un,k�1
j (aj, a�j) otherwise.
Define un such that unj := un,K
j for every player j.
I claim that a⇤i is not rationalizable in the complete information game with payoff
function un, for any n sufficiently large. To show this, let us construct for every player j the
sets (Sk,nj )k�1 of actions surviving k rounds of iterated elimination of strictly dominated
strategies given payoff function un, and show that for n sufficiently large, Sk,nj = Wk
j for
120
all k and every player j. I will use the following intermediate results.
Claim 10 There exists g > 0 such that for any u0 satisfying ku0 � uk• < g, and for any
agent j, if
uj(aj, a�j) > maxa0j 6=aj
uj(aj, a�j)
then
u0j(aj, a�j) > max
a0j 6=aju0
j(aj, a�j).
Proof 17 Let g =
12 mini2I minai2Ai
�
�
�
ui(ai, a�i)� maxa0i 6=aiui(a0i, a�i)
�
�
�
, which exists
by finiteness of I and action sets Ai. The claim follows immediately.
Corollary 5 Let N = eK/g. Then, for every n � N, if
uj(aj, a�j) > maxa0j 6=aj
uj(aj, a�j)
then
un,kj (aj, a�j) > max
a0j 6=ajun,k
j (aj, a�j)
for every k � 1.
Proof 18 Directly follows from Claim 13, since for every j,
kun,kj � ujk• kun
j � ujk• eKn
by construction.
The remainder of the proof proceeds by induction. Trivially, S0,nj = W0
j = Aj for every
j and n. Now consider any agent j and action aj 2 Aj. Suppose there exists some strategy
a�j 2 D(A�j) such that
uj�
aj, a�j�� max
a0j 6=ajuj
⇣
a0j, a�j
⌘
> 0,
121
so that aj is a strict best response to a�j under u. Then aj 2 W1j , and for n � N, also
aj 2 S1,nj (using Corollary 5). Suppose aj is never a strict best response, but there exists
a�j 2 D(A�j) such that
uj�
aj, a�j�� max
a0j 6=ajuj
⇣
a0j, a�j
⌘
= 0.
If aj 2 W1j , then
unj�
aj, a�j�� max
a0j 6=ajun
j
⇣
a0j, a�j
⌘
� uj�
aj, a�j�� max
a0j 6=ajuj
⇣
a0j, a�j
⌘
,
so also ai 2 S1,ni for n � N. If aj /2 Wj, then for n � N, there exists an action a0j 6= aj
such that uj
⇣
a0j, a�j
⌘
= uj�
aj, a�j�
, but uni
⇣
a0j, a�j
⌘
> unj�
aj, a�j�
. So aj /2 S1,nj . No
other actions survive to either W1j or S1,n
j . Thus S1,nj = W1
j for all n � N.
This argument can be repeated for arbitrary k. Suppose Sk,nj = Wk
j for every j and
n � N, and consider any action aj 2 Sk,nj . If there exists some strategy a�j 2 D(Sk,n
�j )
such that
uj�
aj, a�j�� max
a0j 6=ajui
⇣
a0j, a�j
⌘
> 0,
then aj 2 Wk+1j , and for n � N, also aj 2 Sk+1,n
j (using Corollary 5). Suppose aj is not a
strict best response to any a�j 2 D(Sk,n�j ), but there exists a�j 2 D(Sk,n
�j ) such that
uj�
aj, a�j�� max
a0j 6=ajuj
⇣
a0j, a�j
⌘
= 0.
Then, if aj 2 Wk+1j , action aj is a strict best response to a�j under un, so aj 2 Sk+1,n
j .
Otherwise, if aj /2 Wk+1j , then there exists some a0j 2 Wk+1
j such that unj (a0j, a�j) >
unj (aj, a�j), so also aj /2 Sk+1,n
j . No other actions survive to either Wk+1j or Sk+1,n
j , so
Sk+1,nj = Wk+1
j for n � N. Therefore Sk,nj = Wk
j for every k and n � N, and in particular
SK,nj = WK
j for n � N. Since aj /2 WKj , also aj /2 S•,n
j for n sufficiently large, as desired.
122
Finally, notice that by construction kun � uk• eKn , which can be rewritten
kun(e0) � uk• e0
where n(e0) := eKe0 . Thus, for every e0 � 0, the payoff function un(e0)
i 2 Be0(u), but ai
is not rationalizable in the complete information game with payoff function un(e0)i . So
u /2 Int⇣
URa⇤i
⌘
, as desired.
If: Suppose u /2 Int⇣
URa⇤i
⌘
. Consider any sequence of payoff functions un ! u.
Since action sets are finite, there is a finite number of possible orders of elimination. This
implies existence of a subsequence along which the same order of iterated elimination of
strategies removes a⇤i . Choose any one-at-time iteration of this order of elimination. Then,
a⇤i fails to survive this order of elimination given the limiting payoffs u, so it is not weakly
strict-rationalizable.
Next, I show that ai is robust to inference only if the true payoff function u⇤ is
in the interior of URa⇤i
.
Lemma 9 a⇤i is robust to inference only if u⇤ 2 Int⇣
URa⇤i
⌘
.
Proof 19 The following claim will be useful.
Claim 11 u⇤ 2 Int⇣
URa⇤i
⌘
if and only if dq⇤ 2 Int(h�1(Ua)).
Proof 20 See proof of Claim 8.
Suppose u⇤ /2 Int(URa⇤i); then, using Claim 11, also dq⇤ /2 Int(h�1
(URa⇤i)). Under
assumption NI, there is a constant e > 0 such that dq⇤ 2 Int(Fzn) for at least an e-measure
of datasets. Consider any such such dataset. Then, dq⇤ /2 Int⇣
h�1(UR
a⇤i)
⌘
, implies that
Fzn h�1(Ua). Fix any u 2 Fzn\h�1
(URa⇤i). Then a⇤i is not rationalizable in the complete
information game with payoffs u, so it is also not rationalizable for the type with common
certainty in u.
123
If: If a⇤i is strongly strict-rationalizable, then there exists a family of sets (Vkj )j2I
is closed under d-strict best reply for some d � 0; that is, for every aj 2 Vkj , there
exists a distribution a�j 2 D(Vk�j) such that
u⇤j (aj, a�j) > max
a0j 6=aju⇤
j (a0j, a�j) + d.
Recall the following fixed-point property of the set of rationalizable actions:
Lemma 10 (ICR) Fix any type profile (tj)j2I . Consider any family of sets Vj ✓ Aj such
that every action aj 2 Vj is a best reply to a distribution p 2 D(Q ⇥ T�j ⇥ A�j) that
satisfies margQ⇥T�jp = g(tj) and p(a�j 2 V�j[t�j]) = 1. Then, Vj ✓ S•
j [tj] for every
agent j.
Fix any e > 0. Then, for every agent j and type tj with common certainty in Be(u⇤),
we have that
Z
uj(aj, a�j, q)dkj(tj)�maxa0j 6=aj
Z
uj(a0j, a�j, q)dkj(tj)
� infu2Be(u⇤
)
uj(aj, a�j)� maxa0j 6=aj
uj(a0j, a�j)
!
� d � 2e,
which is positive for any e d/2. So the family of sets (Vkj )j2I satisfies the
conditions in Lemma 10 when e is sufficiently small, and it follows that a⇤i 2 S•i [tj],
as desired.
A.3.5 Proof of Proposition 2
To simplify notation, set d := dNEa⇤ . By assumption, d � 0.
Lemma 11 Bd/2(u⇤) ✓ UNE
a⇤ .
124
Proof 21 Consider any payoff function u0 satisfying
ku0 � u⇤k• d
2. (A.6)
Then for every agent i,
u0i(a⇤i , a⇤�i)� u0
i(a0i, a⇤�i) = u0i(a⇤i , a⇤�i)� u⇤
i (a⇤i , a⇤�i)| {z }
��d/2
+ u⇤i (a⇤i , a⇤�i)� u⇤
i (a0i, a⇤�i)| {z }
>d
+ u⇤i (a0i, a⇤�i)� u0
i(a0i, a⇤�i)| {z }
��d/2
� 0.
where u⇤i (a⇤i , a⇤�i)� u⇤
i (a0i, a⇤�i) > d follows from the assumption that a⇤ is a d-strict NE in
the complete information game with payoffs u⇤, and the other two bounds follow from A.6.
So a⇤ is a NE in the complete information game with payoffs u0, implying that u0 2 UNEa⇤ .
It follows from Lemma 2 that common certainty in Bd/2(u⇤) is a sufficient condition
for a⇤ to be a Bayesian Nash equilibrium. Thus,
pNEn (a⇤) � Pn
({zn : h(Fzn) ✓ Bd/2(u⇤)})
= Pn
(
zn : supµ2M
kh(µzn)� u⇤k• d/2
)!
= 1 � Pn
(
zn : supµ2M
kh(µzn)� u⇤k• > d/2
)!
� 1 � 2d
EPn
supµ2M
kh(µzn)� u⇤k•
!
using Markov’s inequality in the final line.
A.3.6 Proof of Proposition 3
To simplify notation, set d := dRa⇤i
. By assumption, d � 0.
Lemma 12 Bd/2(u⇤) ✓ UR
a⇤i.
125
Proof 22 Consider any payoff function u0 satisfying
ku0 � u⇤k• d
2. (A.7)
By definition of dRa⇤i
, there exists a family of sets (Ri)i2I with the property that for every
agent j and action aj 2 Rj, there is an action a�j[aj] 2 D(R�j) satisfying
u⇤i (aj, a�j[ai]) > u⇤
i (a0j, a�j[aj]) + d 8 a0j 6= aj. (A.8)
I will show that (Rj)j2I satisfies the conditions in Lemma 10 for any type profile (tj)j2I ,
where every tj has common certainty in Bd/2(u⇤). Fix an arbitrary agent j, and type tj with
common certainty in Bd/2(u⇤). Define the distribution p 2 D(Q ⇥ T�j ⇥ A�j) such that
margQ⇥T�jp = kj(tj) and margA�j
p = a�j[aj], noting that since a�j[aj] 2 D(R�j),
this implies also that p(a�j 2 R�j) = 1.
Since by assumption, tj has common certainty in Bd/2(u⇤), the support of margQ k(tj)
is contained in Bd/2(u⇤). So the expected payoff from playing aj exceeds the expected payoff
from playing a0j 6= aj by at least
infu2Bd/2(u⇤
)
⇣
u(aj, a�j)� u(a0j, a�j)⌘
� � d
2(A.9)
It follows that
Z
uj(aj, a�j,q)p. �Z
uj(a0j, a�j, q)p. =
Z
uj(aj, a�j, q)p. � u⇤j (aj, a�j, q)
| {z }
�� 12 d
+ u⇤(aj, a�j, q)� u⇤
(a0j, a�j, q)| {z }
>d
+
Z
u⇤j (a0j, a�j, q)p. � uj(a0j, a�j, q)
| {z }
�� 12 d
� 0,
using the inequalities in (A.8) and (A.9). It follows that aj is a best response to a�j given
distribution p. Repeating this argument for every agent j, action aj 2 Rj, and type tj with
common certainty in Bd/2(u⇤), it follows from Lemma 10 that Rj ✓ S•
j [tj] for every agent
126
j. Since a⇤i 2 Ri, also a⇤i 2 S•i [ti], as desired.
It follows from this lemma that Fz ✓ Bd/2(u⇤) is a sufficient condition for a⇤i to
be rationalizable in every game in G(z). Thus,
pRn (i, a⇤i ) � Pn
({zn : h(Fzn) ✓ Bd/2(u⇤)})
= Pn
(
zn : supµ2M
kh(µz)� u⇤k• d/2
)!
= 1 � Pn
(
zn : supµ2M
kh(µz)� u⇤k• > d/2
)!
� 1 � 2d
EPn
supµ2M
kh(µzn)� u⇤k•
!
using Markov’s inequality in the final line.
Proof of Corollary 4
From properties of the least-squares estimator,
E�|b1 � b1|2
�
= Var(b j) Âj
Var(b j)
= s2 Âk
E
✓
⇣
XTX⌘�1
kk
◆
= s2E
✓
tr⇣
XTX⌘�1
◆
= s2E
Âi
l�1i
!
s2 p(p
n +
pp)
where the final line follows from Gordon’s theorem for Gaussian matrices (see e.g.
matrices). Let K be the Lipschitz constant of the map g : Q ! U (assuming the
127
sup-norm on U and the Euclidean norm on Q),
E
supµ2M
kh(µZn)� u⇤k•
!
KE�|b1 � b1|2 + f2
n�
K�
s2 p�p
n +
pp�
+ f2n�
and the desired bound follows directly from Proposition 2.
A.3.7 Proof of Proposition 4
The argument below is for Nash equilibrium; the argument for rationalizability
follows analogously. For every inference rule µ 2 M, define
Xnµ = 1
⇣
h(
µZn) /2 UNEa
⌘
to take value 1 if the expected payoff under the (random) distribution µ(Zn) is
outside the set UNEa . Write Fn
µ for the marginal distribution of random variable
Xnµ, and Fn
M for the joint distribution of random variables (Xnµ)µ2M. Enumerate the
inference rules in M by µ1, . . . , µk.
By Sklar’s theorem, there exists a copula C : [0, 1]k ! [0, 1] such that
FnM(x1, . . . , xk) = C
⇣
Fnµ1(x1), . . . , Fn
µk(xk)
⌘
for every x1, . . . , xk. Using the Frechet-Hoeffding bound,
1 � K +
K
Âk=1
Fnµk(xk) C
⇣
Fnµ1(x1), . . . , Fn
µk(xk)
⌘
mink2{1,...,K}
Fnµk(xk).
From Lemma 2, pNEn (a) = Fn
M(0, . . . , 0). It follows that
1 � K +
K
Âi=1
Fnµk(0) pNE
n (a) mink2{1,...,K}
Fnµk(0). (A.10)
128
Finally, since every Xnµ ⇠ Ber(1 � pn
µ), (A.10) implies
1 � µ2M
pNEµ,n pNE
n (a) 1 � minµ2M
pNEµ,n
as desired.
A.4 Appendix D: An example illustrating the fragility of
weak strict-rationalizability
In the following, I present a game in which an action is weakly strict-rationalizable,
but fails to be rationalizable along a sequence of perturbed types in the uniform-
weak topology.
Consider a game with four players. Each has two actions, a and b. Throughout
I will use, for example, abab to denote choice of a by players 1 and 3, and b by
players 2 and 4. Let payoffs be defined as follows. Player 1’s payoffs satisfy
u1(axxx) =
8
>
<
>
:
1 if xxx = aaa or bbb
0 otherwise.
u1(bxxx) =
8
>
<
>
:
0 if xxx = aaa or bbb
1 otherwise.
That is, player 1 wants to play a if players 2-4 are all playing a or all playing b, and
he wants to play b otherwise. The payoffs to players 2-4 are independent of player
1’s action. They are described below (where rows correspond to player 2’s actions,
columns to player 3, and choice of matrices to player 4), with player 1’s payoffs
omitted, so that the first coordinate corresponds to player 2’s payoff:
129
a b
a 1, 1, 0 0, 0, 0
b 0, 0, 0 0, 0, 0
a b
a 0, 0, 0 0, 0, 0
b 0, 0, 0 1, 1, 0
(A.11)
(a) (b)
That is, if player 4 chooses action a, then players 2 and 3 prefer coordination on a;
and if player 4 chooses b, then players 2 and 3 prefer coordination on b.
Let us first consider the case in which the true payoffs are common certainty, so
that this is a game of complete information (denote the payoffs by u). Then, a is
rationalizable for player 1. Not only is it rationalizable, but:
• there is a constant e > 0 such that a is rationalizable for player 1 in every
game u0 with ku0 � uk• e; that is, rationalizability is preserved on an open
set of complete information games.
• a is weakly strict-rationalizable.
• although a is not strongly strict-rationalizable, it fails to survive this process
for the reason that none of player 4’s actions survive the first round of
elimination.4
Let t1 be the type with common certainty in u. I will now show that there exists
a sequence of types tn1 such that tn
1 ! t1 in the uniform-weak topology, but a fails
to be rationalizable for agent 1 infinitely many times along this sequence. The
sequence of types tn1 will moreover have the property that every tn
1 believes that an
en-neighborhood of u is common certainty, where en ! 0 as n ! •.
4In particular, a is strongly strict-rationalizable in either game in which one of player 4’s actionsis dropped.
130
Define tn1 to satisfy two conditions. First, player 1 is certain5 that: player 2 is
certain that the payoffs in (A.11) are
a b
a 1, 1,�en 0, 0,�en
b 0, 0,�en 0, 0,�en
a b
a �en,�en, 0 0,�en, 0
b 0, 0, 0 1, 1, 0
(A.12)
(a) (b)
and player 2 is certain, moreover, that player 4 is certain of these payoffs. Second,
player 1 is certain that: player 3 is certain that the payoffs in (A.11) are
a b
a 1, 1, 0 0, 0, 0
b �en,�en, 0 �en,�en, 0
a b
a 0, 0,�en 0, 0,�en
b 0, 0,�en 1, 1,�en
(A.13)
(a) (b)
and player 3 is certain, moreover, that player 4 is certain of these payoffs.
Let us now consider the rationalizable actions for players 2 and 3. If player 4
is certain that payoffs are as in (A.12), then action b is his uniquely rationalizable
action. So player 2, with the beliefs described above, believes with probability 1
that player 4 will play b. Since he is himself certain of the payoffs in (A.12), action
b is his uniquely rationalizable action. By a similar argument, if player 4 is certain
that payoffs are as in (A.13), then action a is uniquely rationalizable. So player 3,
with the beliefs described above, believes with probability 1 that player 4 will play
a, and thus considers a to be his uniquely rationalizable action as well.
So player 1 is certain that player 2 will play b and that player 3 will play a. It
follows that his uniquely rationalizable action is b. Since this argument is valid for
5Believes with probability 1.
131
every en > 0, action a is not rationalizable for player 1 of type tni for any n. But
every tni believes that Ben(u) is common certainty, so tn
i ! ti in the uniform-weak
topology.
132
Appendix B
Appendix to Chapter 2
B.1 Experiment Instructions
Subjects were presented with the following introduction screen: Following a trial
round and provision of consent, subjects were presented with 50 identical screens
that looked like the following: Subjects were given 30 seconds to complete each
string, and a timer displayed their remaining time.
B.2 Behavioral Prediction Rules
Rabin prediction rules. Define continuation rule
fR(s1:7) = p(0.5) +6
Âk=0
p(1 � p)k 0.5N � Â7j=7�k sk
N
and classification rule
cR(s) = Âr2{0,1}8
⇣
p ri(1 � p)8� ri
⌘
q(s|r)
133
where
q(s|r) = 0.5rk + (1 � rk)
0
@
0.5N � Âmin j : rk�j=1j=1 rk�j1(sk�j = sk)
N
1
A
is the probability that string s is generated when the urn is refreshed at every ‘1’ in
r. There are two free parameters: p 2 [0, 1] and N 2 N.
gambler prediction rules. Define prediction rule
fRV(s1:7) = 0.5 � a7
Âk=0
dk(2s7�k � 1).
Define classification rule
cRV(s) = Âk
sk
0.5 � a Âjk
dk�jg(sj)
!
+ (1 � sk)
0.5 + a Âjk
dk�jg(sj)
!
.
There are two free parameters: d 2 [0, 1] and a 2 R+
.
134
135
136
Appendix C
Appendix to Chapter 3
C.1 Proof of Theorem 1
C.1.1 Preliminary Notation and Results
I use the following objects and definitions. A hypergraph is a pair H = (V, E) where
V is a finite nonempty set, called the set of vertices, and E is a family of distinct
subsets of V, called the set of edges.1 A k-coloring of a hypergraph is a partition of
its vertex set V into k color classes such that no edge in E is monochromatic. A
hypergraph is k-colorable if it admits an k-coloring. Finally, G = (V, E) is a complete
k-partite graph if there is a partition {Vi}ki=1 of the vertex set V such that {u, v} 2 E
if and only if u and v are in different partitions. The set of all hypergraphs on M
vertices is denoted H. In the remainder of this proof, I refer to hypergraphs simply
as graphs.
These concepts are related to our problem as follows. Enumerate the observa-
tions in any dataset D = {(x, A) : A 2} as {(xi, Ai)}Mi=1. These choice observations
can be identified with a graph H = (V, E) where V = {1, 2, . . . , M} indexes ob-
1This is a generalization of a graph in which edges may connect more than two vertices.
137
servations, and E consists of every set T ✓ V such that: (1) the observations in
{(xi, Ai) | i 2 T} are inconsistent2, and (2) no proper subset of {(xi, Ai) | i 2 T}is inconsistent. I refer to the vertices of H and the observations they represent
interchangeably.
Claim 12 The following statements are equivalent:
1. H is k-colorable.
2. D is k-rationalizable.
Proof 23 Take each color class to represent consistency with a distinct ordering, and the
equivalence directly follows.
For any graph H, let fH be the linear interpolation of points {(k, Dk,H) : k 2Z
+
}, where Dk,H is the minimal number of nodes in H that must be removed for
H to become k-colorable.3 Let FH be the convex hull of the epigraph of fH (see
Figure C.1), and define c := 1l . Then, if there does not exist k 2 N satisfying
Dk,H < DK,H + c(K � k), 4 (C.1)
the line h = {x | (�1,�l) · x = (�1,�l) · (K, DK,H)} properly supports FH at
(K, DK,H), and any solution R⇤ to the minimization problem argminR✓R |R| +lE(D, R) must satisfy |R⇤| = K, as desired.
Finally, suppose that p = 0, so that the agent perfectly maximizes his context-
dependent ordering. This identifies a deterministic graph G.
2There does not exist an ordering r such that xi is r-maximal in Ai for every i 2 T.
3This is equivalent to the definition of Dk used in the main text, through Claim 3.
4In vector notation,(�1,�l) · (K � k,k,H �K,H) � 0.
138
FH
fH
K,H
K
k,H
k
Figure C.1: Any choice of l for which (�1,�l) is a subgradient of fH at (K, DK,H) will recover K.With high probability, the set of vectors
n
(�1,� 1(p+d)
) | d 2⇣
0, d(1�p)K
M � 2p � b⌘o
is a subset of thesubdifferential of fH at (K, DK,H).
Claim 13 G includes at least d non-overlapping complete K-partite subgraphs.5
Proof 24 Each subgraph induced by the vertices in a K-violation of IIA is a complete
K-partite graph.
C.1.2 Main Proof
Imperfect maximization using the random choice rule P generates a probability
distribution over H. Denote the random graph with this distribution by H. Fix
d 2⇣
0, d(1�p)K
M � 2p � b⌘
and l = (p + d)M. I will now show that the probability
that no k 2 N satisfies (C.1) is at least 1 � O�
c�M�, from which it will follow that
the probability that Eq. (3.2) recovers K is at least 1 � O�
c�M�.
In the subsequent claims, take S ⇠ Bin(M, p) to be the number of observations
which are imperfectly maximized, and take VE ✓ V to be the random variable
whose outcome is the set of imperfectly maximized observations.
5Two subgraphs are said to be non-overlapping if they do not share vertices.
139
Lemma 13 The probability that no k > K satisfies (C.1) is at least 1 � e�2d2 M.
Proof 25 Since DK,H S, if there exists k > K such that Dk,H < DK,H � c(k � K), then
necessarily S > c = (p+ d)M. Otherwise, Dk,H < c(K � k+ 1) 0. Since E(S) = pM,
it follows from Hoeffding’s Inequality that
Pr(S � c) = Pr(S � pM � c � pM) exp✓
�2((p + d)M � pM)
2
M
◆
= e�2d2 M,
and therefore the probability that no k > K satisfies (C.1) is at least 1 � e�2d2 M, as desired.
Lemma 14 The probability that no k < K satisfies (C.1) is at least
1 � e�b22 M�
1 � exp✓
� (1 � p)Kb2M2(2p + d + b)
◆�
.
Proof 26 If there exists k < K satisfying (C.1), then H must include strictly fewer than
c + S non-overlapping complete K-partite graphs. Otherwise, Dk,H � (c + S)(K � k) �DK,H + c(K � k) for every k < K, since every complete K-partite graph is K-colorable and
every subgraph of a complete graph is itself a complete graph.
Define HP to be the random subgraph of H induced by vertices in V\VE (perfectly
maximized observations). Then, if HP contains at least c + S non-overlapping complete
K-partite graphs, H must also. I determine the probability that HP includes at least c + S
non-overlapping complete K-partite graphs as a lower bound.
I first show that S <⇣
p +
b2
⌘
M with probability at least 1� e�b22 M, and subsequently
that conditional on the eventn
S <⇣
p +
b2
⌘
Mo
, subgraph HP includes at least c + S
non-overlapping complete K-partite graphs with probability 1 � exp⇣
� (1�p)K b2 M2(2p+d+b)
⌘
. The
first statement follows from immediately from Hoeffding’s inequality, since E(S) = pM
140
and
Pr✓
S <
✓
p +
b
2
◆
M◆
� 1 � e�b22 M. (C.2)
Suppose S < (p +
b2 )M. Index the d complete K-partite subgraphs in G (existence from
Claim 13) by i = 1, 2, . . . , d. Let Xi be the indicator variable which takes value 1 if every
vertex in complete K-partite subgraph i is perfectly maximized, and let X = Âdi=1 Xi. Notice
Xi ⇠ Ber((1 � p)K) for every i and EX = d(1 � p)K. Using Hoeffding’s inequality,
Pr(X < c + S) = Pr⇣
X � d(1 � p)K < c + S � d(1 � p)K⌘
exp✓
�2(c + S � d(1 � p)K
)
2
d
◆
.
Since by assumption d 2⇣
0, d(1�p)K
M � 2p � b⌘
, it follows that d � (2p+b+d)M(1�p)K . Therefore,
d(1 � p)K > (2p + d + b)M > c + S, so
∂ f (S)∂d
= �2(1 � p)2K+
2(c + S)2
d2 < 0, and
∂ f (S)∂S
= 4(1 � p)K � 4(c + S)d
> 0.
where f (S) = �2(c+S�d(1�p)K)
2
d . Then, using the upper bound on S and the lower bound
on d,
exp✓
�2(c + S � d(1 � p)K
)
2
d
◆
exp✓
� (1 � p)Kb2M2(2p + d + b)
◆
. (C.3)
From (C.2) and (C.3), the probability that no k > K satisfies (C.1) is at least
1 � e�b22 M�
1 � exp✓
� (1 � p)Kb2M2(2p + d + b)
◆�
as desired.
141
Using Lemmas 13 and 14, the probability that no k 2 Z+
satisfies (C.1) is at least
1 � e�b22 M�
1 � exp✓
� (1 � p)Kb2M2(2p + d + b)
◆�
� exp��2d2M
�
= 1 � O(c�M),
where c = minn
exp( b2
2 ), exp⇣
(1�p)K b2
2(2p+d+b)
⌘
, exp�
2d2�o
.
C.2 Corollary 1
Take b = 0.1 and d = .05. Then for any d > 2p+d(1�p)K M =
0.25(1�p)K M,
d(1 � p)K
M� 2p � b > 0.25 � 2p � b = 0.05
so that d = 0.05 2⇣
0, d(1�p)K
M � 2p � b⌘
. Directly apply Theorem 1.
C.3 Corollary 2
By assumption, P includes at least d non-overlapping sets of choice problems in
K-violation of IIA. Enumerate the Kd choice problems included in a K-violation
using i = 1, . . . , Kd. Let Zi be the random variable whose outcome is the number of
times choice problem i is sampled, and let Qi be the event {Zi < a� M
2NK
�}, where
a =
2p+ b2 +d
2p+b+d < 1. Then,
Pr
Kd[
i=1Qi
!
Kd
Âi=1
Pr(Qi) = Kd Pr✓
Z1 < a
✓
M2NK
◆◆
.
142
Since Z1 ⇠ Bin�
M, 12NK
�
and E(Z1) =M
2NK , it follows from Hoeffding’s Inequality
that
Kd Pr✓
Z1 � M2NK
<aM2NK
� M2NK
◆
Kd exp
�2⇥
(1 � a) M2N
⇤2
M
!
= Kd exp
� (1 � a)2M22N�1
�
:= g(K, d, p, b, d, M)
(C.4)
Therefore Pr�
Zi � a� M
2NK
�
for every i�
= 1�Pr⇣
SKdi=1 Qi
⌘
� 1�Kd exph
� (1�a)2 M22N�1
i
.
Conditional on the event {Zi � a� M
2NK
�
for every i}, there are at least da� M
2NK
�
>
2p+ b2 +d
(1�p)K M non-overlapping sets of K choice problems in K-violation of IIA in the
sampled data, and we can apply Theorem 1 to conclude that probability of recovery
has lower bound
f (K, d, p, b, d, M) :=
1 � e�b28 M�
"
1 � exp
� (1 � p)Kb2M8(2p + d + b
2 )
!#
� exp��2d2M
�
.
(C.5)
From (C.4) and (C.5), the probability of recovery is at least
(1 � g(K, d, p, b, d, M)) f (K, d, p, b, d, M) = 1 � O(c�M)
with
c := c(K, d, p, b, d) = min { exp✓
b2
8
◆
, exp
(1 � p)Kb2
8(2p + d + b2 )
!
,
exp(2d2), exp
2
4
122N�1
1 � 2p +
b2 + d
2p + b + d
!23
5
9
=
;
> 1
as desired.
143
C.4 Proof of Proposition 1
For any set of orderings R and ordering r 2 R, define g(r, R) to be the set of
all choice observations consistent with maximization of r and inconsistent with
maximization of any other r0 2 R.6 The set of revealed preferences in g(r, R) is
given by the binary relation
Br := {(x, y) : (x, A) 2 g(r, R) for some A including y}.
Let Br be its transitive closure. The following is a necessary condition for identifia-
bility of R.
Claim 14 Suppose there exist orderings r, r0 2 R such that
argmaxi ri 6= argmini r0i. (C.6)
Then, R is not identified.
Proof 27 First, I show that R =
�
r1, . . . , rK is identified only if Br is complete for every
r 2 R. Suppose to the contrary that R is identified, but (without loss of generality) Br1 is
not complete. Then there exists some ordering r1 6= r1 such that every choice observation in
g(r1, R) is consistent with maximization of r1, so we can replace r1 with r1 in R without
loss of any choice implications. Formally, define R0=
�
r1, . . . , rK . Since C(R) ✓ C(R0),
it follows from Observation 1 that R is not identified.
Next, I show that if there exist orderings r, r0 2 R satisfying (C.6), then Br is
not complete for some r 2 R. Index the alternatives such that x1 := argmaxi ri and
x2 := argmini r0i. I show that neither (x1, x2) nor (x2, x1) is in Br0 , and hence Br0 is not
complete. Suppose towards contradiction that (x1, x2) 2 Br0 . Then (x1, A) 2 g(r0, R)
6For example, if R = {(1, 2, 3), (2, 3, 1)}, then g((1, 2, 3), R) = {(x3, {x1, x2, x3}), (x3, {x1, x3}),(x3, {x2, x3})}, since these observations are consistent with maximization of (1, 2, 3) and inconsistentwith maximization of (2, 3, 1).
144
for some A 2. But since x1 is r-maximal, every observation in which x1 is selected is
consistent with maximization of ordering r0. Thus, for every A 2, choice observation
(x1, A) /2 g(r0, R). This yields the desired contradiction. Suppose alternatively that
(x2, x1) 2 Br0 . Then, (x2, A) 2 g(r0, R) for some A 2. But x2 is ranked last according to
r0, so (x2, A) /2 g(r0, R) for every A 2. This yields the desired contradiction. Therefore, if
there exist orderings r, r0 2 R satisfying (C.6) then R is not identified.
It follows immediately from Claim 14 that every set R =
�
r1, r2 with argmaxi rji 6=
argmini r3�ji for some j 2 {1, 2} is not identified. Moreover, since every set R with
|R| � 3 must include orderings satisfying (C.6), it follows from Claim 14 that every
set R with three or more orderings orderings is not identified.
Next, I show that sets R =
�
r1, r2 with argmaxi rji = argmini r3�j
i for j =
1, 2 are identified. Index the alternatives such that x1 = argmaxi r1i and xN =
argmaxi r2i , and define D = C(r1
) [ C(r2). Suppose to the contrary that there
exists a set of orderings R0= {r1, r2} 6= R such that E(D, R0
) = 0. I show
a contradiction by identifying an observation in D which is inconsistent with
maximization of both r1 and r2.
First observe that necessarily either x1 is highest ranked in r1 and xN is highest
ranked in r2 or vice versa, since (x1, X), (xN , X) 2 D. Without loss of generality,
suppose the former. Since r1 6= r1, there exist alternatives xk, xl with k, l /2 {1, N}such that r1
k < r1l and r1
k > r1l ; that is, xk is higher ranked than xl under r1 but not
under r1. Let A be the set of all alternatives ranked lower than xk in ordering r1,
noting that xN 2 A since xN = argmini r1i by assumption. Then, choice observation
(xk, A) 2 C(r1), but is inconsistent with maximization of r2 since xN 2 A. Moreover,
(xk, A) is inconsistent with maximization of r1 since xl 2 A. This yields the desired
contradiction.
Finally, every singleton set R = {r} is trivially identified using the set of all of
145
its choice implications C(r).
146