Reinforcement Learning in Perfect-Information Games∗
Maxwell Pak†
Queen’s University
July 2006
Key words: reinforcement learning, extensive-form games
JEL Classification: D83
MSC2000 subject classification: primary: 91A18, 91A22; secondary: 68Q32
OR/MS subject classification: primary: games/group decisions
∗This paper could not have been written without the help and encouragement from my advisorChris Shannon. Her suggestions have been indispensable for this work from its inception, and Iowe her my deepest gratitude. I have also benefited greatly from many discussions with BotondKoszegi. I would like to thank Robert Anderson, Jim Bergin, Steve Goldman, Matthew Rabin, andRobert Powell for their helpful comments as well. Support from the NSF grants SES-9710424 andSES-9818759 is gratefully acknowledged.
†Department of Economics, Queen’s University, Kingston, Ontario K7L 3N6, Canada. Phone:613.533.2251. Fax: 613.533.6668. E-mail: [email protected].
1
Abstract
This paper studies action-based reinforcement learning in finite perfection-
information games. Restrictions on the valuation updating rule that guaran-
tee that the play eventually converges to a subgame-perfect Nash equilibrium
(SPNE) are identified. These conditions are mild enough to contain interesting
and plausible learning behavior. We provide two examples of such updating rule
that suggest that the extent of knowledge and rationality assumptions needed
to support a SPNE outcome in finite perfect-information games may be mini-
mal.
2
1 Introduction
Most learning models in game theory feature strategy-based learning in which play-
ers attempt to learn their optimal strategies.1 While this may be reasonable in
simultaneous-move games, strategy-based learning models fail to provide a good de-
scriptive model of learning in extensive-form games because they require the players
to form complete contingent plans of actions at all of their decision nodes before
the game is played. For example, even in a relatively simple game like tic-tac-toe
in which each player makes at most four action choices, the game tree, ignoring
symmetry, contains roughly 9! nodes, and forming a complete strategy for the game
is beyond the ability of human players.
Since much of the real learning that takes place in extensive-form games ap-
pears to involve players who do not use complete strategies, this paper considers
action-based reinforcement learning model in which players make an action choice
at a decision node only when that node is reached during the game play.2 In par-
ticular, we study a model in a setting where players repeatedly play a finite perfect-
information game but treat each play myopically so that they are only concerned
with the current period payoffs. When a decision node is reached during the game
play, the player who moves at that node assigns valuations to her available actions
and chooses the action with the highest valuation. The valuations are based on the
payoff experience the player has had in the past periods in which that action had
been played. We identify conditions on the valuation updating rule that leads to
the players eventually playing a subgame-perfect Nash equilibrium (SPNE).
Our approach is similar to Jehiel and Samet [4] and Laslier and Walliser [7].
1For example, in fictitious play models players keep track of their opponents’ plays and choose astrategy that is a best response to the empirical frequency of their opponents’ plays. In replicatordynamic models, players adjust the frequency with which they play a pure strategy in proportionto the current payoff corresponding to that strategy.
2Good surveys of reinforcement learning in the machine learning literature are Sutton andBarto [9] and Kaelbling, Littman, and Moore [5]. The papers collected in Sutton [8] provide amore technical introduction to the reinforcement learning literature.
3
However, while these papers specify one particular valuation updating rule and
show that the play converges to a SPNE in some appropriate sense, we provide a
more general result in that we identify general restrictions on the valuation updating
rules that guarantee convergence to a SPNE. As we show through examples, these
restrictions are mild enough to encompass a wide range of learning behaviors.
Moreover, Jehiel and Samet employ ε-greedy learning rule in which players take
the action with the highest valuation with probability 1 − ε or explore by takinga random action with probability ε. However, whether the exploration is viewed
as an experimentation or a mistake, assuming that exploration probability stays
the same no matter how much experience a player has gained seems antithetic in a
model of learning. A more plausible model of learning should reflect the fact that
the rate at which a player experiments, or makes mistakes, when playing the same
game for the millionth time is lower than when she has played it only for the first
few times. In addition, the constant exploration probability is also undesirable in a
prescriptive model of learning because it is difficult to know the appropriate value
for the exploration probability. The desire to converge to a near optimal behavior in
the limit requires the probability of exploration to be set at small value. However,
exploring with small probability means that it will be difficult for the player to
escape suboptimal behavior in the early plays of the game. Thus, good limiting
behavior comes at the price of poor finite time performance.
To resolve the tension between the desired limit behavior and the desired finite
time behavior, a player should explore often in early periods and less frequently in
later periods. Here, we accomplish this by endogenizing the exploration probability.
In fact, in our model the players do not explicitly experiment. Rather, they always
take an action with the highest valuation but are nevertheless induced to explore
because of the imperfection in forming valuations.3 We show that if the valuation of
3We believe this approach to be more natural when the players, as in Jehiel and Samet [4] andLaslier and Walliser [7], are assumed to be myopic and treat each game as an end in itself. In suchsetting, it is not clear why players would choose to experiment. Since the players are not concernedwith future payoffs, there is no reason why players would be willing to sacrifice current payoff andtake an action that they believe to be suboptimal.
4
an action becomes more accurate as, and only as, the number of times that action is
taken increases, then the player is induced to explore with ever decreasing frequency
but still infinitely often. This, in turn, leads the play to converge to a SPNE.
The remainder of the paper is organized as follows. Section 2 describes the
reinforcement learning model considered here. Because endogenizing exploration
comes with technical burden, the main results in their full generality are discussed in
Appendix A. Instead, Section 3 restricts attention to additively separable valuations
that are sum of two terms, an empirical term and an error term, and presents four
conditions that together guarantee convergence to a SPNE. The two substantive
conditions required for these valuations have intuitive interpretations. The first
condition requires that the error term converges to zero as the player’s experience
with that action grows. The second condition requires that if the fraction of time a
player receives some payoff u when an action is chosen converges to one, then the
empirical term converges to that payoff.
In Section 3.1, two examples of additively separable valuations, sample averag-
ing model and more primitive, and more interesting, simple recollection model are
presented and shown to satisfy the four conditions. Section 3.2 presents the main
results for additively separable valuations. The first result shows that if a valuation
process satisfies the first substantive condition, then players explore every action
infinitely often. The second result shows that if it satisfies both conditions, then the
play converges to a SPNE outcome in probability and the fraction of time in which a
SPNE outcome is played converges to one almost surely in finite perfect-information
games with no relevant ties in the payoffs. The proofs of the results are deferred to
Appendix A, and instead an intuition for the results are developed using the simple
recollection model.
Finally, a simulation result suggesting that the two examples considered here
attain near-SPNE behavior relatively fast is given in Section 3.3. The simulation
results also illustrate the tension inherent in the ε-greedy algorithm and show that
5
models considered in this paper can outperform the ε-greedy models in both finite
time and the limit. The paper concludes in Section 4.
2 The Model
We consider a set of players who repeatedly play a finite perfect-information game
G but treat each game myopically. Let G be the set of decision nodes of G, z0 theroot node, and GT the set of terminal nodes. For each node z ∈ G\GT , i(z) denotesthe player who moves at z and Az denotes the set of actions available at z. In abuse
of notation, i(a) is used to denote the player to whom the action a belongs to so
that i(a) = i(z), where a ∈ Az. Let Gz denote the subgame starting at z and let Gzdenote the set of nodes in Gz. For each z ∈ GT , let ui(z) denote player i’s payofffrom z. Let Ci = min{ui(z) : z ∈ GT } and Ci = max{ui(z) : z ∈ GT }. Let Abe the set of all actions in G. For later convenience we add the “null action” a0,interpreted as the action immediately preceding the root node z0, to A. With this
addition, we can define a bijection ζ : A → G that maps each action to the nodethat immediately succeeds it.
The players are assumed to play the game in the following sequential manner.
Let T = {1, 2, 3, ...} be the time index. For each t ∈ T, the game in period t beginsby player i(z0) choosing an action from Az0 . Player i(z0) is assumed to have some
valuation vt(a) for each action a ∈ Az0 and choose an action a′ with the highestvaluation. When there is a tie, the player chooses according to some arbitrary tie-
breaking rule. Once action a′ has been chosen, the game proceeds to node z′ = ζ(a′)
and player i(z′) moves next. Player i(z′) is also assumed to have some valuation
vt(a) for each action a ∈ Az′ and choose an action a′′ with the highest valuation.The game proceeds in this manner until a terminal node is reached and each player
i receives her payoff, which is denoted uit. The outcome of the t-th play of the game
is identified by the path, denoted ξt, that was taken during the play.4 How the game
4That is, the path ξt is the unique sequence of actions that starts from a0 and leads to the
6
play evolves over time is governed by how the valuation are updated, and identifying
the restrictions on the valuation updating rule, represented by the valuation process
{vt(a) : t ∈ T} for each a ∈ A or collectively by {vt : t ∈ T}, that leads the gameplay to evolve towards a SPNE is the main goal of the paper.
3 Additively Separable Valuation Process
For valuations satisfying additive separability, the sufficient conditions that guaran-
tee convergence to a SPNE are particularly intuitive and easily verified. Therefore,
this section restricts attention to valuation processes such that for all a ∈ A andt ∈ T, vt(a) = ft(a) + gt(a), where ft(a), interpreted as the empirical term, hassupport inside [Ci(a), Ci(a)] and gt(a), interpreted as the error term, has full sup-
port.5 We show that if an additively separable valuation process satisfies the four
assumptions listed below, then the play converges to a SPNE.
In the following, let F denote the σ-field generated by the underlying valuationupdating rule, and let sub-σ-field Ft denote the collection of events up to time t. Forany event B, indicator variable I(B) is the random variable taking the value 1 or
0 depending on whether event B has occurred or not. Let Nt(a) =∑t
k=1 I(a ∈ ξt)denote the number of times action a has occurred up to time t.
Assumption (A1). For all a′, a′′ ∈ Az with a′ 6= a′′, vt+1(a′) and vt+1(a′′) areconditionally independent given Ft. That is, for all Borel B′, B′′ ⊂ R,
P(vt+1(a′) ∈ B′ and vt+1(a′′) ∈ B′′ | Ft
)= P
(vt+1(a′) ∈ B′ | Ft
)P
(vt+1(a′′) ∈ B′′ | Ft
)
almost surely.
Assumption (A2). For all a ∈ A, conditional distribution of vt+1(a) changes only
terminal node reached at the end of the t-th play of the game.
5Full support assumption is made for convenience. For further discussion, see general assumption(GA3) in Appendix A.
7
after action a has been sampled. That is, for any Borel B ⊂ R,
P (vt+2(a) ∈ B | Ft+1) = P (vt+1(a) ∈ B | Ft) on {a 6∈ ξt+1} .
Assumption (A3). For all a ∈ A, error term converges to zero as the number oftimes action a is taken increases. That is, for all ε > 0,
P (|gt+1(a)| > ε | Ft) → 0 on {Nt(a) →∞} .
Assumption (A4). For all a ∈ A, whenever
∑tn=1 I(a ∈ ξn)I
(u
i(a)n = u
)
Nt(a)→ 1 as Nt(a) →∞
for some constant u, then ft(a) → u as t →∞.6
Assumptions (A1) and (A2) formalize how valuation process for action-based
learning should behave. Assumption (A1) requires that the valuations of different
actions are independent when conditioned on the past history. Assumption (A2)
requires that the conditional distribution of the valuation of an action changes only
after that action has been taken. Assumption (A3) requires that the error term in
the valuation of an action decreases to zero, in appropriate probabilistic sense, as
the player’s experience with that action grows. Lastly, to interpret assumption (A4),
suppose that the fraction of time a player receives some payoff u when action a is
taken converges to one. Assumption (A4) then requires that the empirical term
corresponding to action a must also converge, again in appropriate sense, to u.
3.1 Examples of Additively Separable Valuation Processes
For an immediate example of an additively separable valuation process satisfying
assumptions (A1)-(A4), consider the following model of learning, which we call
6Formally, ft(a) → u as t →∞ on�Pt
n=1 I(a∈ξn)I(uin=u)Nt(a)
→ 1 as Nt(a) →∞�
.
8
sample averaging model. When evaluating an action, players in this model try to
use the average of the past payoffs associated with the action. Players are assumed
to have imperfect ability to calculate the historical average, so the valuation assigned
to an action is a perturbed average of the payoffs received in the periods in which
that action had been taken; however, the error associated with evaluating an action
is assumed to decrease as the number of times that action has been taken by the
player increases.
For all a ∈ A, let {εat : t ∈ T} be independent copies of a random variable εa
that has full support. The valuation process {vt : t ∈ T} for the model is given bythe following.
vt(a) =∑t−1
n=1 I(a ∈ ξn) uinÑt−1(a)
+εat
Ñt−1(a),
where i = i(a) and Ñt−1(a) = max{1, Nt−1(a)}. Proposition 1 shows formally thatsample averaging model satisfies assumptions (A1)-(A4).
Proposition 1. Sample averaging model satisfies assumptions (A1)-(A4).
Proof. It is easy to see that sample averaging model satisfies (A1) and (A2). Fix
any z′ ∈ GT . Letting
ft(a) =ui(z′) I(Nt−1 = 0)
Ñt−1(a)+
∑t−1n=1 I(a ∈ ξn) uin
Ñt−1(a),
and
gt(a) =εat − ui(z′) I(Nt−1 = 0)
Ñt−1(a),
where i = i(a), it is also easy to see that gt(a) has full support and that ft(a) ∈[Ci, Ci] almost surely. For any ε > 0,
P (|gt+1(a)| > ε | Ft) = P(|εat+1 − ui(z′) I(Nt = 0)| > εÑt(a) | Ft
)
= P(|εa − ui(z′) I(Nt = 0)| > εÑt(a) | Ft
)
→ 0 on {Nt(a) →∞} .
9
So, (A3) is satisfied. Lastly, since
ui(z′) I(Nt−1 = 0)Ñt−1(a)
→ 0
on {Nt(a) →∞}, (A4) is readily satisfied.
For more interesting example, consider the next model, which we call simple
recollection model. This model tries to capture the following primitive learning
behavior. When evaluating an action, players try to remember what the payoffs
had been in the previous periods in which that action had been chosen and assign
one of the past payoffs as the value of that action. Naturally, the more a player
receives a particular payoff after playing action a, the more likely it is that the
value attached to action a is that particular payoff. Players are assumed to have
imperfect memory so that there is always a chance that players make an erroneous
recall. However, the chance of making an error when evaluating an action decreases
as the number of times that action has been taken by the player increases.
For all a ∈ A, let {ηat : t ∈ T} be independent copies of uniform[0, 1] randomvariable, and let {εat : t ∈ T} be independent copies of an independent randomvariable εa that has full support. Let τak denote the period in which action a was
chosen for the k-th time. The valuation process {vt : t ∈ T} is given formally by:
vt(a) =Nt−1(a)∑
k=1
I
(ηat ∈
(k
1 + Nt−1(a),
k + 11 + Nt−1(a)
])uiτak
+ I(
ηat ≤1
1 + Nt−1(a)
)εat ,
where i = i(a).
This process behaves as if there is an urn, or a memory bank, corresponding to
each action. Each urn initially contains one card, called the wild card. Suppose
node z is reached during the course of the t-th play. Player i(z) assigns a value vt(a)
to each a ∈ Az by drawing a card from the urn corresponding to action a. If a card
10
that is drawn is a wild card then the value assigned to the action is the outcome of
a draw from an independent random variable εat . If the card is not a wild card then
the value assigned to the action is the pre-recorded value on the card. In either case
the card is placed back into the urn once the value has been assigned to a.
If during the course of the t-th play action a was chosen, then at the end of the
t-th play player i(a)’s payoff in the t-th play is recorded on a new card and placed
into the urn for that action. Thus, each time action a is chosen, the number of
cards in the corresponding urn increases by one. Therefore, the chance of receiving
a random valuation declines as players gain experience but never disappears. On
the other hand, the distribution of vt(a) converges to the empirical distribution of
the payoffs received when action a was chosen as the number of the times in which
action a is chosen goes to infinity. For example, if player i receives the same payoff
u after choosing action a for all but finitely many times, then vt(a) converges to u
with probability one as t goes to infinity.
Proposition 2. Simple recollection model satisfies assumptions (A1)-(A4).
Proof. It is easy to see that simple recollection model satisfies (A1) and (A2). Fix
any z′ ∈ GT . Letting
ft(a) = I(
ηat ≤1
1 + Nt−1(a)
)ui(z′) +
Nt−1(a)∑
k=1
I
(ηat ∈
(k
1 + Nt−1(a),
k + 11 + Nt−1(a)
])uτak ,
and
gt(a) = I(
ηat ≤1
1 + Nt−1(a)
)(εat − ui(z′)
),
where i = i(a), it is also easy to see that gt(a) has full support and that ft(a) ∈[Ci, Ci] almost surely. For any ε > 0,
P (|gt+1(a)| > ε | Ft) = P(
ηat+1 ≤1
1 + Nt(a)and |εat+1 − ui(z′)| > ε
∣∣∣ Ft)
= P(
ηa ≤ 11 + Nt(a)
∣∣∣ Ft)
P(|εa − ui(z′)| > ε
∣∣∣ Ft)
→ 0 on {Nt(a) →∞} .
11
So, (A3) is satisfied.
Lastly, on{Pt
n=1 I(a∈ξn)I(uin=u)Nt(a)
→ 1 as Nt(a) →∞}
,
I
(ηat ≤
11 + Nt−1(a)
)ui(z′) → 0
andNt−1(a)∑
k=1
I
(ηat ∈
(k
1 + Nt−1(a),
k + 11 + Nt−1(a)
])uτak → u
as t →∞. So, (A4) is satisfied.
3.2 Main Results for Additively Separable Valuation Process
Theorem 1 shows that valuation processes satisfying assumptions (A1)-(A3) gen-
erate sufficient exploration by inducing players to play every action in the game
infinitely often with probability one. In particular, this means that SPNE paths oc-
cur infinitely often. Of course, this is not a strong result since the same theorem also
shows that every path occurs infinitely often as well. So, this theorem also serves
as a negative result showing that the probability of non-SPNE paths occurring only
finitely many times is zero.
Theorem 1. Let G be a finite perfect-information game. Suppose an additively
separable valuation process {vt : t ∈ T} satisfies assumptions (A1)-(A3). Then∀a ∈ A, Nt(a) →∞ almost surely as t →∞.
Given the weak restriction placed on valuation processes by assumption (A3), it
should not be surprising that more restriction is needed to obtain positive conver-
gence results. The additional restriction required is provided by assumption (A4).
Because non-SPNE paths occur infinitely often almost surely, it is clear that the
play cannot converge almost surely to a SPNE path. However, Theorem 2 shows
that the play can converge to a SPNE path in probability so that the occurrence of
non-SPNE paths becomes increasingly rare.
12
Since ties in the payoffs can present difficulties, the convergence results in The-
orem 2 are limited to the following class of games. Let Γ be the collection of all
finite perfect-information games such that ∀z′, z′′ ∈ GT , ui(z′) = ui(z′′) if and onlyif uj(z′) = uj(z′′) for all j. So, Γ includes games with generic payoffs, which have no
ties in the payoffs, and games where ties are irrelevant like “win-lose-draw” games.
Theorem 2 shows that if G ∈ Γ and the valuation process satisfies assumptions (A1)-(A4), then the probability of playing a SPNE path converges to one as the number
of plays goes to infinity. In addition, the fraction of time in which a SPNE path is
played converges to one almost surely.7
Theorem 2. Let G ∈ Γ. Suppose an additively separable valuation process {vt : t ∈T} satisfies assumptions (A1)-(A4). Then the probability of playing a SPNE pathconverges to one, and the fraction of time a SPNE path of G is played converges to
one almost surely as t →∞.
Theorem 8 in the appendix show that assumptions (A1)-(A4) are special cases
of general assumptions (GA1)-(GA4). Therefore, Theorem 1 and Theorem 2 follow
from corresponding theorems for general valuation processes, Theorems 3 and 6 in
Appendix A. Rather than discuss the general theorems and proofs here, we defer
their discussion to the appendix and instead develop intuition for the results using
the simple recollection model.
Consider the 2-player game given in Figure 1. Since there are only two actions
at the root node, one of the two actions, say action L1, must be played infinitely
often. Since the valuation vt(L1) takes the random wild card value with probability
11+Nt−1(L1) , the probability of vt(L1) taking a wild card value goes to zero as the
number of plays goes to infinity. So, the probability that vt(L1) = 3 converges to
one.
7Moreover, as shown in Theorem 6, of which Theorem 2 is a special case, this is true for allsubgames of G as well. That is, the probability of playing a SPNE path of Gz when node z isreached converges to one. Also, the ratio of the number of times in which a SPNE path of Gz isplayed to the number of times in which node z is reached converges to one almost surely. Theplayers, therefore, learn the optimal action at every node and not just at nodes on a SPNE path.
13
.................................................................................................................................................................................................................................................................................................................................................................................................
.................................................................................................................................................................................................................................................................................................................................................................................................
.................................................................................................................................................................................................................................................................................................
.................................................................................................
.................................................................................................
•1(z0)
• 2(z1)
•1(z2) • 1(z3)
L1 R1
L2 R2
L3 R3 L4 R4
(37
) (49
) (26
) (18
) (05
)
•
Figure 1: Example
Now, suppose action R1 occurs only finitely often, say M − 1 many times, sothat time T is the last time action R1 occurs. Then, for all t > T , the probability
that vt(R1) takes a random wild card value is at least 1M . So, for all large t > T ,
the probability that vt(R1) is greater than vt(L1), which is at least P (vt(L1) = 3)×P (wild card is drawn at R1) × P (wild card at R1 > 3), is approximately P (ε
R1>3)M .
Since this is true for every large t > T , it is not possible that the event {vt(R1) >vt(L1)} never occurs after time T , contrary to our assumption. Therefore, action R1must occur infinitely often. Similar intuition shows that at each node, every action
must occur infinitely often. Of course, this argument is not entirely correct since,
among other things, T is random and the events considered here are not independent
events. The proof of Theorem 3 provides a formal argument.
Next, the convergence of the play to the SPNE path can be demonstrated by an
induction argument. Consider node z2. Since both actions L3 and R3 are played
infinitely often, the probability that vt(L3) = 4 and vt(R3) = 2 converges to 1
as t → ∞. So, the probability that player 1 plays action L3 converges to one.Likewise, the probability that player 1 plays action L4 converges to one as well. So,
the probability that player 2 receives payoff 9 after choosing L2 converges to one,
which in turn makes the probability that vt(L2) = 9 converge to one. Likewise, the
probability that vt(R2) = 8 converges to one as well. Therefore, the probability that
14
player 2 chooses action L2 converges to one. Through induction, it is not hard to
see that the probability that player 1 will choose R1 converges to one. Thus, the
play converges to the SPNE of the game. Since not all the subgames are played at
every period, complications arise in formalizing this induction argument. The proof
of Theorem 6 solves this problem with the use of stopping times.
3.3 Finite Time Properties: Simulation Results
The theoretical results obtained here are that of asymptotic convergence. However,
often it is argued that asymptotic results, while reassuring, have little practical im-
plications.8 In light of such views, this section concludes with simulation results that
roughly assess the finite time properties of the two valuation processes. Specifically,
1000 copies of the simple 3 player game given in Figure 2 were generated, each with
payoffs drawn randomly from normal distribution with mean zero and variance 100.
The learning behavior of simple recollection model and sample averaging model on
each of these games, both with εat ’s drawn from normal distribution with mean zero
and variance 100, were simulated and compared to the simulated learning behavior
generated by the sample averaging ε-greedy models and the sample averaging greedy
(ε = 0) model.
For each simulation, the fraction of time the SPNE outcome had occurred up
to the t-th play was tracked for each t ≤ 2000 in each of the learning models. Thegraph in Figure 3 presents the average of these ratios over 1000 simulations for
each learning model. The results suggest that the simple recollection model and the
sample averaging model not only reach reasonable optimality relatively quickly but
also outperform the ε-greedy models quickly.
Since simple recollection model starts with greater randomness than ε-greedy
8See, for example, the remarks on the use of eventual convergence to optimal behavior in mea-suring learning performance in Kaelbling, Littman, and Moore [5]. As another example, Ellison [2]questions the relevance of the asymptotic convergence results of Kandori, Mailath, and Rob [6]since the time required before the system reaches its limit is beyond the time scale relevant forhumans.
15
..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
......................................................................... ................................
................................
.........
........................................................................................................................................................................................................................... ................................
................................
................................
................................
................................
................................
...........................
......................................................................... ................................
................................
.........
•1
•2
•2
•3
•3
•3
•3
(u11u12u13
) (u21u22u23
) (u31u32u33
) (u41u42u43
) (u51u52u53
) (u61u62u63
) (u71u72u73
) (u81u82u83
)
Figure 2: Test Problem
0 200 400 600 800 1000 1200 1400 1600 1800 20000.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
# Plays
# S
PN
E p
aths
/ #
Pla
ys
SR
SA
E 0.1
E 0.01
GR
SA: Sample Averaging SR: Simple Recollection EG 0.1:Epsilon Greedy, epsilon = 0.1 EG 0.01:Epsilon Greedy, epsilon = 0.01GR:Greedy
Figure 3: Simulation Results
16
models, it initially performs worse than ε-greedy models. However, whereas ε-
greedy models are forced to explore with the same probability no matter how much
experience the players have gained, simple recollection model allows the exploration
probabilities to decline with experience, so it outperforms the ε-greedy models even-
tually. In these simulations, this occurs before the 400th play.
In addition, these simulations also illustrate the inherent tension between finite
time and limit behavior present in ε-greedy models. As discussed before, while
smaller values of ε leads to better limit behavior, it comes at the price of poorer
finite time performance. Indeed, the simulations show that while ε-greedy model
with ε = 0.01 will eventually have a better limit behavior than the model with
ε = 0.1, it performs worse for the first 2000 periods considered. Sample averaging
model with endogenous exploration probability introduced in this paper solves this
tension and performs better in both finite time and the limit.
Lastly, to see the importance of maintaining exploration, consider the learning
curve corresponding to the greedy method. Because the greedy method starts out
with no randomness, the greedy method initially performs as well as or better than
the other methods. However, because it never explores, it is trapped in taking
suboptimal actions in the later plays.
4 Concluding Remarks
This paper studies sufficient conditions on the valuation updating rule that guaran-
tee that the play converges to a subgame-perfect Nash equilibrium in finite perfect-
information games with no relevant ties in the payoffs. The restrictions we identify
are mild enough to contain interesting and plausible learning behavior. The two ex-
amples of such valuation updating rule we provide correspond to learning behaviors
that are primitive but still satisfy the two basic principles of learning outlined in
Erev and Roth [3]: Law of Effect (Thorndike [10]) and the Power Law of Practice
17
(Blackburn [1]), which assert, respectively, that the more often an action leads to
a good outcome the more likely it is that this action will be chosen in the future,
and that learning curves rise steeply initially but flatten out later. The very naivete
of the learning behaviors associated with these models suggests that in this class of
games, the extent of rationality that is needed to support a subgame-perfect Nash
equilibrium may be minimal.
18
A General Valuation Process
This section presents the restrictions on general valuation updating rules that guar-
antee convergence to a SPNE. In particular, additive separability assumption is
dropped. Let (Ω,F , P ) be the probability space on which a valuation process{vt : t ∈ T} is defined. Let F0 = {∅, Ω), and let Ft = σ(v1, ..., vt) be thesub-σ-field consisting of events up to time t. For any action a ∈ A, let τan de-note the time of the n-th occurrence of action a. That is, τa0 = 0 and ∀n ∈ T,τan = inf{t > τn−1 : a ∈ ξt}. Then, {τan : n ∈ Z+} is a sequence of stopping times.The following facts about stopping times are used throughout this section. For any
stopping time τ , Fτ = {B ∈ F : ∀n B ∩ {τ ≤ n} ∈ Fn} is a σ-field. Supposeτ0 < τ1 < τ2 < ... almost surely, then {Fτn : n ∈ Z+} is a filtration. Moreover, if{Yt : t ∈ Z+} is adapted to {Ft : t ∈ Z+}, then Yτn is adapted to {Fτn : n ∈ Z+},and if Yt → Y almost surely as t →∞, then Yτn → Y almost surely as n →∞.
Each of the four conditions (GA1)-(GA4) below plays the same role as the corre-
sponding assumptions (A1)-(A4) for additively separable valuations. Assumptions
(GA1) and (GA2) formalize how valuation process for reinforcement learning should
behave: (GA1) requires that valuations of actions are independent when conditioned
on the past history, and (GA2) requires that the conditional distribution of valuation
of an action changes only after that action has been taken. The substantive con-
ditions are assumptions (GA3) and (GA4): (GA3) guarantees that every action is
taken infinitely often, and (GA4) provides the additional condition that guarantees
that the play path converges to a SPNE path.
Assumption (GA1). For all a ∈ A, a′, a′′ ∈ Aζ(a) with a′ 6= a′′, and Borel B′, B′′ ⊂R,
P(vτan+1(a
′) ∈ B′ and vτan+1(a′′) ∈ B′′ | Fτan)
= P(vτan+1(a
′) ∈ B′ | Fτan)
P(vτan+1(a
′′) ∈ B′′ | Fτan)
almost surely.
19
Assumption (GA2). For all a ∈ A, a′ ∈ Aζ(a), and Borel B ⊂ R,
P(vτan+2(a
′) ∈ B | Fτan+1)
= P(vτan+1(a
′) ∈ B | Fτan)
on{
a′ 6∈ ξτan+1}
.
In the following, let C = 1 + max{|ui(z)| : i ∈ I, z ∈ GT }, where I is the set ofplayers of G.9
Assumption (GA3). For all a ∈ A and a′ ∈ Aζ(a),
{P (vτan+1(a
′) ≥ C | Fτan ) → 0 as n →∞}
={Nτan (a
′) →∞ as n →∞} .
To interpret assumption (GA3), first note that the event {vτan (a′) ≥ C} is theevent that the valuation of action a′ is “overly optimistic” and
{Nτan (a
′) →∞ as n →∞}
is the event that the number of times action a′ is taken goes to infinity. Therefore,
assumption (GA3) essentially requires that the probability of a player having an
overly optimistic valuation of an action declines to zero if and only if the number of
times a′ is sampled goes to infinity.10
Assumption (GA4). For all a ∈ A, i = i(ζ(a)), and a′, a′′ ∈ Aζ(a), if there existconstants u′ > u′′ such that
∑tn=1 I(a
′ ∈ ξn)I(uin = u
′)
Nt(a′)→ 1 a.s. as Nt(a′) →∞
and ∑tn=1 I(a
′′ ∈ ξn)I(uin = u
′′)
Nt(a′′)→ 1 a.s. as Nt(a′′) →∞,
then
P(vτan+1(a
′) > vτan+1(a′′)
∣∣∣ Fτan)→ 1 a.s. as n →∞.
9It is not hard to see from the proof of Theorem 3 that C need not be this large. It is enoughfor C to be the player i(ζ(a))’s highest possible payoff in the subgame Gζ(a). The particular choiceof C made here simplifies notation and the proofs.
10A simpler way to state this requirement would have been the expression “P (vτan (a′) ≥ C) → 0
if Nτn(a′) → ∞.” However, because the events {vτan (a′) ≥ C}, n ∈ T, are not independent, a
conditional version of this idea as stated in assumption (GA3) is needed.
20
To interpret assumption (GA4), suppose that the fraction of time a player re-
ceives some payoff u′ when an action a′ is taken and the fraction of time that the
player receives some payoff u′′ when an action a′′ is taken both converges to one,
where u′ > u′′. Assumption (GA4) then requires that the conditional probability of
the player’s valuation of action a′ being higher than the valuation of action a′′ goes
to one.
Theorem 3 shows that valuation processes satisfying assumptions (GA1)-(GA3)
induce players to play every action in the game infinitely often with probability one.
Theorem 3. Let G be a finite perfect-information game. Suppose a valuation pro-
cess {vt : t ∈ T} satisfies assumptions (GA1)-(GA3). Then ∀a ∈ A, Nt(a) → ∞almost surely as t →∞.
Proof. Take any a ∈ A, and let z = ζ(a). The proof proceeds by showing thatif a ∈ ξt infinitely often a.s., then ∀a′ ∈ Az, Nt(a′) → ∞ a.s. So, assume a ∈ ξtinfinitely often a.s. so that τan < ∞ a.s. for all n. Since |Az| < ∞ there mustbe at least one action in Az that occurs infinitely often a.s. Suppose, towards
contradiction, that there exists nonempty Âz ⊂ Az such that P (B) > 0, where B isthe event {∀a′ ∈ Âz, a′ ∈ ξt f.o. and ∀a′ ∈ Az\Âz, a′ ∈ ξt i.o.}.
Fix â ∈ Âz, and let constant C be as in assumption (GA3). Then there existsY > 0 a.s. such that P
(vτan+1(â) ≥ C | Fτan
)≥ Y for all n on B. Moreover,
P(vτan+1(a
′) ≥ C | Fτan)→ 0 for all a′ ∈ Az\Âz on B. Therefore, on B,
∞∑
n=1
P
(vτan+1(â) > max
a′∈Az\Âz{vτan+1(a′)}
∣∣∣ Fτan)
≥∞∑
n=1
P(vτan+1(â) ≥ C and ∀a′ ∈ Az\Âz, vτan+1(a′) < C
∣∣∣ Fτan)
.
By conditional independence,
=∞∑
n=1
P
(vτan+1(â) ≥ C
∣∣∣ Fτan) ∏
a′∈Az\ÂzP
(vτan+1(a
′) < C∣∣∣ Fτan
)
21
≥∞∑
n=1
Y
∏
a′∈Az\Âz
(1− P
(vτan+1(a
′) ≥ C∣∣∣ Fτan
))
= ∞.
=⇒ vτan+1(â) > maxa′∈Az\Âz
{vτan+1(a′)} i.o. by the conditional Borel-Cantelli Lemma.
However, maxa′∈Az\Âz{vτan+1(a′)} > maxa′∈Âz{vτan+1(a′)} for all but finitely manytimes on B. Since â ∈ Âz, this is a contradiction. Thus, ∀a′ ∈ Az, Nt(a′) →∞ a.s.as t →∞.
Finally, since a0 is the null action, a0 ∈ ξt infinitely often a.s. So, the theoremfollows by induction.
The following two corollaries follow trivially from Theorem 3, so the proofs are
omitted.
Corollary 4. If the assumptions of Theorem 3 are satisfied, every SPNE path occurs
infinitely often with probability one.
Corollary 5. If the assumptions of Theorem 3 are satisfied, the probability of SPNE
paths occurring for all but finitely many times is zero.
Recall that Γ is defined as the collection of all finite perfect-information games
such that ∀z′, z′′ ∈ GT , ui(z′) = ui(z′′) if and only if uj(z′) = uj(z′′) for all j.Theorem 6 shows that if G ∈ Γ and the valuation process satisfies assumptions(GA1)-(GA4), then the probability of playing a SPNE path converges to one as the
number of plays goes to infinity. In addition, the fraction of time in which a SPNE
path is played converges to one almost surely as the number of plays goes to infinity.
Moreover, Theorem 6 shows that this is true for all subgames of G as well. Thatis, the probability of playing a SPNE path of Gz when node z is reached convergesto one as the number of times z is played goes to infinity. Also, the ratio of the
number of times in which a SPNE path of Gz is played to the number of times inwhich node z is reached converges to one almost surely as the number of plays goes
22
to infinity. The players, therefore, eventually learn the optimal action at every node
and not just at nodes on a SPNE path.
In the following, if a ∈ ξ let ξa = ξ ∩ {a′ ∈ Az′ : z′ ∈ Gζ(a)} denote the continua-tion path of ξ. Let Ξa = {path ξ of G : a ∈ ξ and ξa is a SPNE path of Gζ(a)}, andlet St(a) =
∑tn=1 I(ξn ∈ Ξa).
Theorem 6. Suppose G ∈ Γ and the valuation process {vt : t ∈ T} satisfies as-sumptions (GA1)-(GA4). Then for all a ∈ A, P (ξτan ∈ Ξa) → 1 as n → ∞, andSt(a)Nt(a)
→ 1 a.s. as t →∞.11
Proof. The proof proceeds by induction on subgames of G. So, let L(Gz) denote themaximum length of a path in Gz. For all z ∈ G \GT , let Ãz = {ã ∈ Az : there existsa SPNE of Gz in which ã occurs with positive probability}.
Let a ∈ A be such that L(Gζ(a)) = 1. Let z = ζ(a) and i = i(z). By Theo-rem 3, ∀a′ ∈ Az, Nt(a′) → ∞ a.s. Since L(Gz) = 1, ζ(a′) ∈ GT for all a′ ∈ Az.So, I(a′ ∈ ξt)I
(uit = u
i(ζ(a′)))
= I(a′ ∈ ξt) a.s. So,Pt
n=1 I(a′∈ξn)I(uin=ui(ζ(a′)))
Nt(a′) =Ptn=1 I(a
′∈ξn)Nt(a′) = 1 a.s. for all t ∈ T. Let ã ∈ Ãz. Then, a.s.,
1 ≥ E[I(ξτan+1 ∈ Ξa) | Fτan ]
= P(ξτan+1 ∈ Ξa) | Fτan
)
≥ P(vτan+1(ã) > vτan+1(a
′) for all a′ ∈ Az \ Ãz∣∣∣ Fτan
)
=∏
a′∈Az\ÃzP
(vτan+1(ã) > vτan+1(a
′)∣∣∣ Fτan
)by conditional independence.
→ 1 as n →∞ by assumption (GA4).
Therefore, E[I(ξτan+1 ∈ Ξa) | Fτan ] → 1 a.s. as n → ∞, and P (ξτan ∈ Ξa) → 1 as
11The first version of this theorem showed that St(a)Nt(a)
→ 1 in expectation. It was not until I sawthe use of a stability theorem in Jehiel and Samet [4] that I realized that the proof can be easilystrengthened to show almost sure convergence. I would like to thank them for their insight. Theversion of the stability theorem used is stated and proved as Lemma B.2.
23
n →∞ by the dominated convergence theorem. Next,
∑nk=1 E
[I(ξτak+1 ∈ Ξa) | Fτk
]
n→ 1 a.s. as n →∞ by Lemma B.1,
and
∑nk=1
(I(ξτak+1 ∈ Ξa)− E
[I(ξτak+1 ∈ Ξa) | Fτk
])
n→ 0 a.s. as n →∞ by Lemma B.2.
=⇒∑n
k=1 I(ξτak ∈ Ξa)n
→ 1 a.s. as n →∞.
=⇒ St(a)Nt(a)
=∑t
n=1 I(ξn ∈ Ξa)Nt(a)
=
∑Nt(a)k=1 I(ξτak ∈ Ξa)
Nt(a)→ 1 a.s. as t →∞.
Next, suppose for all subgames Gζ(a) of G such that L(Gζ(a)) ≤ m, E[I(ξτan+1 ∈Ξa) | Fτan ] → 1 a.s. as n →∞ and St(a)Nt(a) → 1 a.s. as t →∞. Let a ∈ A be such thatL(Gζ(a)) = m + 1. Let z = ζ(a) and i = i(z). By Theorem 3, ∀a′ ∈ Az, Nt(a′) →∞a.s. Then, for any ã ∈ Ãz, a.s.,
1 ≥ P(vτan+1(ã) > vτan+1(a
′) for all a′ ∈ Az \ Ãz∣∣∣ Fτan
)
=∏
a′∈Az\ÃzP
(vτan+1(ã) > vτan+1(a
′)∣∣∣ Fτan
)by conditional independence.
By the induction hypothesis St(a′′)
Nt(a′′) → 1 a.s. ∀a′′ ∈ Az. So,
→ 1 as n →∞ by assumption (GA4).
Therefore, a.s.,
1 ≥ E[I(ξτan+1 ∈ Ξa) | Fτan ]
= E
∑
ã∈ÃzI
(vτan+1(ã) = maxa′∈Az
{vτan+1(a′)})
I(ξτan+1 ∈ Ξã
) ∣∣∣∣ Fτan
= E
∑
ã∈ÃzI
(vτan+1(ã) = maxa′∈Az
{vτan+1(a′)}) ∑
ã∈ÃzI
(ξτan+1 ∈ Ξã
) ∣∣∣∣ Fτan
= P
(maxã∈Ãz
{vτan+1(ã)
}> max
a′∈Az\Ãz
{vτan+1(a
′)} ∣∣∣ Fτan
) ∑
ã∈ÃzP (ξτan+1 ∈ Ξã | Fτan )
→ 1 as n →∞ by the induction hypothesis.
24
So, E[I(ξτan+1 ∈ Ξa) | Fτan ] → 1 a.s. as n →∞. Then, P (ξτan ∈ Ξa) → 1 as n →∞,and St(a)Nt(a) → 1 a.s. as t →∞ by the same argument as before.
Corollary 7. Let G be a finite perfect-information game with generic payoffs, andlet valuation process {vt : t ∈ T} satisfy assumptions (GA1)-(GA4). Then theprobability of playing the SPNE path converges to one, and the fraction of time the
SPNE path of G is played converges to one almost surely as t →∞.
Proof. Since G has a unique SPNE path, the corollary follows trivially from Theo-
rem 6.
We close with a theorem showing that additive separability and assumptions (A1)-
(A4) are special case of general assumptions (GA1)-(GA4).
Theorem 8. Suppose an additively separable valuation process {vt : t ∈ T} satisfiesassumptions (A1)-(A3). Then the valuation process satisfies the general assumptions
(GA1)-(GA3). If, in addition, the process satisfies assumption (A4), then it also
satisfies general assumption (GA4).
Proof. Let a = a0. Since a is the null action, τan < ∞ a.s. Then by (A1) and strongMarkov property, for all a′, a′′ ∈ Gζ(a) with a′ 6= a′′, condition (GA1) is satisfied.Likewise, by (A2) and strong Markov property, for all a′ ∈ Gζ(a), condition (GA2)is satisfied. Let i = i(ζ(a)). For all a′ ∈ Aζ(a),
P(vτan+1(a
′) ≥ C | Fτan)
≤ P(fτan+1(a
′) ∈ [Ci, Ci] and gτan+1(a′) ≥ C − Ci | Fτan)
= P(fτan+1(a
′) ∈ [Ci, Ci] | Fτan)
P(gτan+1(a
′) ≥ C − Ci | Fτan)
= P(gτan+1(a
′) ≥ C − Ci | Fτan)
→ 0 as n →∞ on {Nτan (a′) →∞ as n →∞}
by (A3) and strong Markov property. Conversely, on{
P(vτan+1(a
′) ≥ C | Fτan)→ 0
},
Nτan (a′) → ∞ since the conditional distribution of vτan+1(a′) differs from that of
vτan (a′) only if a′ ∈ ξτn . Therefore, condition (GA3) is satisfied.
25
The induction step in the proof of Theorem 3, shows that then τa′
n < ∞ a.s. forall a′ ∈ Aζ(a). Therefore, by induction, {vt(a) : t ∈ T} satisfies (GA1)-(GA3) for alla ∈ A.
For (GA4), consider any a ∈ A, and let i = i(ζ(a)). Suppose for some a′ and a′′
in Aζ(a), there exist constants u′ > u′′ such that
∑tn=1 I(a
′ ∈ ξn)I(uin = u
′)
Nt(a′)→ 1 a.s. as Nt(a′) →∞,
and ∑tn=1 I(a
′′ ∈ ξn)I(uin = u
′′)
Nt(a′′)→ 1 a.s. as Nt(a′′) →∞.
By (A4), ft(a′) → u′ and ft(a′′) → u′′ a.s. as t →∞. Let ε = u′−u′′3 . Then a.s.,
P(vτan (a
′) > vτan (a′′)
∣∣∣ Fτan−1)
≥ P(|gτan (a′)| < ε and |gτan (a′′)| < ε
∣∣∣ Fτan−1)
× P(fτan (a
′) ∈ [u′ − ε, u′ + ε] and fτan (a′′) ∈ [u′′ − ε, u′′ + ε]∣∣∣ Fτan−1
)
→ 1 a.s. as n →∞.
B Stability Lemmas
This section presents the lemmas that are used in the proof of Theorem 6.12
Lemma B.1. Let Zn → 1 a.s. as n → ∞, and 0 ≤ Zn ≤ 1 a.s. ∀n. Then,Pnk=1 Zk
n → 1 a.s. as n →∞.
Proof. Consider any ω ∈ Ω such that Zn(ω) → 1 as n →∞. Fix any ε > 0. Then,12The stability theorem is a well known fact that often appears in standard probability textbook
as an exercise. However, since a proof for the version that is needed could not be located, it ispresented as Lemma B.2 and a proof is provided.
26
∃Nε > 0 such that ∀n > Nε, Zn(ω) ≥ 1− ε. Then, ∀n > Nε,
∑nk=1 Zk(ω)
n=
∑Nεk=1 Zk(ω)
n+
∑nk=Nε+1
Zk(ω)n
≥∑Nε
k=1 Zk(ω)n
+(n−Nε)(1− ε)
n
So, ∀ε > 0,
lim inf∑n
k=1 Zk(ω)n
≥ lim inf(∑Nε
k=1 Zk(ω)n
+(n−Nε)(1− ε)
n
)= 1− ε.
Therefore,Pn
k=1 Zk(ω)n → 1, and
Pnk=1 Zk
n → 1 a.s.
Lemma B.2. Let {Zn : n ∈ T} be adapted to filtration {Fn : n ∈ T}, and letF0 = {∅, Ω}. Suppose there exists a constant M such that |Zn| < M a.s. for all n.Then,
Pnk=1(Zk−E[Zk | Fk−1])
n → 0 a.s. as n →∞.
Proof. Let Yn =∑n
k=1Zk−E[Zk | Fk−1]
k . Then, EY2n < ∞. Also, a.s.,
E[Yn+1 | Fn]
=n+1∑
k=1
E[Zk | Fn]−E[E[Zk | Fk−1] | Fn]k
=E[Zn+1 | Fn]− E[Zn+1 | Fn]
n + 1+
n∑
k=1
E[Zk | Fn]−E[E[Zk | Fk−1] | Fn]k
=n∑
k=1
Zk −E[Zk | Fk−1]k
= Yn since ∀k ≤ n, Zk is Fn-measurable and Fk−1 ⊂ Fn.
So, {Yn : n ∈ T} is a martingale with bounded second moments. By the martingaleconvergence theorem, there exists Y such that
n∑
k=1
Zk − E[Zk | Fk−1]k
= Yn → Y a.s. as n →∞.
Then by Kronecker’s lemma,Pn
k=1(Zk−E[Zk | Fk−1])n → 0 a.s. as n →∞.
27
References
[1] Blackburn, J. (1936). Acquisition of Skill: An Analysis of Learning Curves.
IHRB Report, No. 73.
[2] Ellison, G. (1993). Learning, Local Interaction, and Coordination. Economet-
rica, 61: 1047-1072.
[3] Erev, I. and A. Roth. (1998). Predicting How People Play Games: Reinforce-
ment Learning in Experimental Games with Unique, Mixed Strategy Equilibria.
American Economic Review, 88: 848-881.
[4] Jehiel, P. and D. Samet. (2000). Learning to Play Games in Extensive Form by
Valuation. Working Paper.
[5] Kaelbling, L., M. Littman, and A. Moore. (1996). Reinforcement Learning: A
Survey. Journal of Artificial Intelligence Research, 4: 237-285.
[6] Kandori, M., G. Mailath, and R. Rob. (1993). Learning, Mutation, and Long
Run Equilibria in Games. Econometrica, 61: 29-56.
[7] Laslier, J-F. and B. Walliser. (2005) A Reinforcement Learning Process in Ex-
tensive Form Games. International Journal of Game Theory, 33: 219-227.
[8] Sutton, R. (ed.). (1992). Reinforcement Learning. Boston: Kluwer Academic
Publishers.
[9] Sutton, R. and A. Barto. (1998). Reinforcement Learning: An Introduction.
Cambridge: MIT Press.
[10] Thorndike, E. (1911). Animal Intelligence. Macmillan: New York.
28