Reinforcement Learning in Perfect-Information...

Reinforcement Learning in Perfect-Information Games∗

Maxwell Pak†

Queen’s University

July 2006

Key words: reinforcement learning, extensive-form games

JEL Classification: D83

MSC2000 subject classification: primary: 91A18, 91A22; secondary: 68Q32

OR/MS subject classification: primary: games/group decisions

∗This paper could not have been written without the help and encouragement from my advisorChris Shannon. Her suggestions have been indispensable for this work from its inception, and Iowe her my deepest gratitude. I have also benefited greatly from many discussions with BotondKoszegi. I would like to thank Robert Anderson, Jim Bergin, Steve Goldman, Matthew Rabin, andRobert Powell for their helpful comments as well. Support from the NSF grants SES-9710424 andSES-9818759 is gratefully acknowledged.

†Department of Economics, Queen’s University, Kingston, Ontario K7L 3N6, Canada. Phone:613.533.2251. Fax: 613.533.6668. E-mail: [email protected].

1

Abstract

This paper studies action-based reinforcement learning in finite perfection-

information games. Restrictions on the valuation updating rule that guaran-

tee that the play eventually converges to a subgame-perfect Nash equilibrium

(SPNE) are identified. These conditions are mild enough to contain interesting

and plausible learning behavior. We provide two examples of such updating rule

that suggest that the extent of knowledge and rationality assumptions needed

to support a SPNE outcome in finite perfect-information games may be mini-

mal.

2

1 Introduction

Most learning models in game theory feature strategy-based learning in which play-

ers attempt to learn their optimal strategies.1 While this may be reasonable in

simultaneous-move games, strategy-based learning models fail to provide a good de-

scriptive model of learning in extensive-form games because they require the players

to form complete contingent plans of actions at all of their decision nodes before

the game is played. For example, even in a relatively simple game like tic-tac-toe

in which each player makes at most four action choices, the game tree, ignoring

symmetry, contains roughly 9! nodes, and forming a complete strategy for the game

is beyond the ability of human players.

Since much of the real learning that takes place in extensive-form games ap-

pears to involve players who do not use complete strategies, this paper considers

action-based reinforcement learning model in which players make an action choice

at a decision node only when that node is reached during the game play.2 In par-

ticular, we study a model in a setting where players repeatedly play a finite perfect-

information game but treat each play myopically so that they are only concerned

with the current period payoffs. When a decision node is reached during the game

play, the player who moves at that node assigns valuations to her available actions

and chooses the action with the highest valuation. The valuations are based on the

payoff experience the player has had in the past periods in which that action had

been played. We identify conditions on the valuation updating rule that leads to

the players eventually playing a subgame-perfect Nash equilibrium (SPNE).

Our approach is similar to Jehiel and Samet [4] and Laslier and Walliser [7].

1For example, in fictitious play models players keep track of their opponents’ plays and choose astrategy that is a best response to the empirical frequency of their opponents’ plays. In replicatordynamic models, players adjust the frequency with which they play a pure strategy in proportionto the current payoff corresponding to that strategy.

2Good surveys of reinforcement learning in the machine learning literature are Sutton andBarto [9] and Kaelbling, Littman, and Moore [5]. The papers collected in Sutton [8] provide amore technical introduction to the reinforcement learning literature.

3

However, while these papers specify one particular valuation updating rule and

show that the play converges to a SPNE in some appropriate sense, we provide a

more general result in that we identify general restrictions on the valuation updating

rules that guarantee convergence to a SPNE. As we show through examples, these

restrictions are mild enough to encompass a wide range of learning behaviors.

Moreover, Jehiel and Samet employ ε-greedy learning rule in which players take

the action with the highest valuation with probability 1 − ε or explore by takinga random action with probability ε. However, whether the exploration is viewed

as an experimentation or a mistake, assuming that exploration probability stays

the same no matter how much experience a player has gained seems antithetic in a

model of learning. A more plausible model of learning should reflect the fact that

the rate at which a player experiments, or makes mistakes, when playing the same

game for the millionth time is lower than when she has played it only for the first

few times. In addition, the constant exploration probability is also undesirable in a

prescriptive model of learning because it is difficult to know the appropriate value

for the exploration probability. The desire to converge to a near optimal behavior in

the limit requires the probability of exploration to be set at small value. However,

exploring with small probability means that it will be difficult for the player to

escape suboptimal behavior in the early plays of the game. Thus, good limiting

behavior comes at the price of poor finite time performance.

To resolve the tension between the desired limit behavior and the desired finite

time behavior, a player should explore often in early periods and less frequently in

later periods. Here, we accomplish this by endogenizing the exploration probability.

In fact, in our model the players do not explicitly experiment. Rather, they always

take an action with the highest valuation but are nevertheless induced to explore

because of the imperfection in forming valuations.3 We show that if the valuation of

3We believe this approach to be more natural when the players, as in Jehiel and Samet [4] andLaslier and Walliser [7], are assumed to be myopic and treat each game as an end in itself. In suchsetting, it is not clear why players would choose to experiment. Since the players are not concernedwith future payoffs, there is no reason why players would be willing to sacrifice current payoff andtake an action that they believe to be suboptimal.

4

an action becomes more accurate as, and only as, the number of times that action is

taken increases, then the player is induced to explore with ever decreasing frequency

but still infinitely often. This, in turn, leads the play to converge to a SPNE.

The remainder of the paper is organized as follows. Section 2 describes the

reinforcement learning model considered here. Because endogenizing exploration

comes with technical burden, the main results in their full generality are discussed in

Appendix A. Instead, Section 3 restricts attention to additively separable valuations

that are sum of two terms, an empirical term and an error term, and presents four

conditions that together guarantee convergence to a SPNE. The two substantive

conditions required for these valuations have intuitive interpretations. The first

condition requires that the error term converges to zero as the player’s experience

with that action grows. The second condition requires that if the fraction of time a

player receives some payoff u when an action is chosen converges to one, then the

empirical term converges to that payoff.

In Section 3.1, two examples of additively separable valuations, sample averag-

ing model and more primitive, and more interesting, simple recollection model are

presented and shown to satisfy the four conditions. Section 3.2 presents the main

results for additively separable valuations. The first result shows that if a valuation

process satisfies the first substantive condition, then players explore every action

infinitely often. The second result shows that if it satisfies both conditions, then the

play converges to a SPNE outcome in probability and the fraction of time in which a

SPNE outcome is played converges to one almost surely in finite perfect-information

games with no relevant ties in the payoffs. The proofs of the results are deferred to

Appendix A, and instead an intuition for the results are developed using the simple

recollection model.

Finally, a simulation result suggesting that the two examples considered here

attain near-SPNE behavior relatively fast is given in Section 3.3. The simulation

results also illustrate the tension inherent in the ε-greedy algorithm and show that

5

models considered in this paper can outperform the ε-greedy models in both finite

time and the limit. The paper concludes in Section 4.

2 The Model

We consider a set of players who repeatedly play a finite perfect-information game

G but treat each game myopically. Let G be the set of decision nodes of G, z0 theroot node, and GT the set of terminal nodes. For each node z ∈ G\GT , i(z) denotesthe player who moves at z and Az denotes the set of actions available at z. In abuse

of notation, i(a) is used to denote the player to whom the action a belongs to so

that i(a) = i(z), where a ∈ Az. Let Gz denote the subgame starting at z and let Gzdenote the set of nodes in Gz. For each z ∈ GT , let ui(z) denote player i’s payofffrom z. Let Ci = min{ui(z) : z ∈ GT } and Ci = max{ui(z) : z ∈ GT }. Let Abe the set of all actions in G. For later convenience we add the “null action” a0,interpreted as the action immediately preceding the root node z0, to A. With this

addition, we can define a bijection ζ : A → G that maps each action to the nodethat immediately succeeds it.

The players are assumed to play the game in the following sequential manner.

Let T = {1, 2, 3, ...} be the time index. For each t ∈ T, the game in period t beginsby player i(z0) choosing an action from Az0 . Player i(z0) is assumed to have some

valuation vt(a) for each action a ∈ Az0 and choose an action a′ with the highestvaluation. When there is a tie, the player chooses according to some arbitrary tie-

breaking rule. Once action a′ has been chosen, the game proceeds to node z′ = ζ(a′)

and player i(z′) moves next. Player i(z′) is also assumed to have some valuation

vt(a) for each action a ∈ Az′ and choose an action a′′ with the highest valuation.The game proceeds in this manner until a terminal node is reached and each player

i receives her payoff, which is denoted uit. The outcome of the t-th play of the game

is identified by the path, denoted ξt, that was taken during the play.4 How the game

4That is, the path ξt is the unique sequence of actions that starts from a0 and leads to the

6

play evolves over time is governed by how the valuation are updated, and identifying

the restrictions on the valuation updating rule, represented by the valuation process

{vt(a) : t ∈ T} for each a ∈ A or collectively by {vt : t ∈ T}, that leads the gameplay to evolve towards a SPNE is the main goal of the paper.

3 Additively Separable Valuation Process

For valuations satisfying additive separability, the sufficient conditions that guaran-

tee convergence to a SPNE are particularly intuitive and easily verified. Therefore,

this section restricts attention to valuation processes such that for all a ∈ A andt ∈ T, vt(a) = ft(a) + gt(a), where ft(a), interpreted as the empirical term, hassupport inside [Ci(a), Ci(a)] and gt(a), interpreted as the error term, has full sup-

port.5 We show that if an additively separable valuation process satisfies the four

assumptions listed below, then the play converges to a SPNE.

In the following, let F denote the σ-field generated by the underlying valuationupdating rule, and let sub-σ-field Ft denote the collection of events up to time t. Forany event B, indicator variable I(B) is the random variable taking the value 1 or

0 depending on whether event B has occurred or not. Let Nt(a) =∑t

k=1 I(a ∈ ξt)denote the number of times action a has occurred up to time t.

Assumption (A1). For all a′, a′′ ∈ Az with a′ 6= a′′, vt+1(a′) and vt+1(a′′) areconditionally independent given Ft. That is, for all Borel B′, B′′ ⊂ R,

P(vt+1(a′) ∈ B′ and vt+1(a′′) ∈ B′′ | Ft

)= P

(vt+1(a′) ∈ B′ | Ft

)P

(vt+1(a′′) ∈ B′′ | Ft

)

almost surely.

Assumption (A2). For all a ∈ A, conditional distribution of vt+1(a) changes only

terminal node reached at the end of the t-th play of the game.

5Full support assumption is made for convenience. For further discussion, see general assumption(GA3) in Appendix A.

7

after action a has been sampled. That is, for any Borel B ⊂ R,

P (vt+2(a) ∈ B | Ft+1) = P (vt+1(a) ∈ B | Ft) on {a 6∈ ξt+1} .

Assumption (A3). For all a ∈ A, error term converges to zero as the number oftimes action a is taken increases. That is, for all ε > 0,

P (|gt+1(a)| > ε | Ft) → 0 on {Nt(a) →∞} .

Assumption (A4). For all a ∈ A, whenever

∑tn=1 I(a ∈ ξn)I

(u

i(a)n = u

)

Nt(a)→ 1 as Nt(a) →∞

for some constant u, then ft(a) → u as t →∞.6

Assumptions (A1) and (A2) formalize how valuation process for action-based

learning should behave. Assumption (A1) requires that the valuations of different

actions are independent when conditioned on the past history. Assumption (A2)

requires that the conditional distribution of the valuation of an action changes only

after that action has been taken. Assumption (A3) requires that the error term in

the valuation of an action decreases to zero, in appropriate probabilistic sense, as

the player’s experience with that action grows. Lastly, to interpret assumption (A4),

suppose that the fraction of time a player receives some payoff u when action a is

taken converges to one. Assumption (A4) then requires that the empirical term

corresponding to action a must also converge, again in appropriate sense, to u.

3.1 Examples of Additively Separable Valuation Processes

For an immediate example of an additively separable valuation process satisfying

assumptions (A1)-(A4), consider the following model of learning, which we call

6Formally, ft(a) → u as t →∞ on�Pt

n=1 I(a∈ξn)I(uin=u)Nt(a)

→ 1 as Nt(a) →∞�

.

8

sample averaging model. When evaluating an action, players in this model try to

use the average of the past payoffs associated with the action. Players are assumed

to have imperfect ability to calculate the historical average, so the valuation assigned

to an action is a perturbed average of the payoffs received in the periods in which

that action had been taken; however, the error associated with evaluating an action

is assumed to decrease as the number of times that action has been taken by the

player increases.

For all a ∈ A, let {εat : t ∈ T} be independent copies of a random variable εa

that has full support. The valuation process {vt : t ∈ T} for the model is given bythe following.

vt(a) =∑t−1

n=1 I(a ∈ ξn) uinÑt−1(a)

+εat

Ñt−1(a),

where i = i(a) and Ñt−1(a) = max{1, Nt−1(a)}. Proposition 1 shows formally thatsample averaging model satisfies assumptions (A1)-(A4).

Proposition 1. Sample averaging model satisfies assumptions (A1)-(A4).

Proof. It is easy to see that sample averaging model satisfies (A1) and (A2). Fix

any z′ ∈ GT . Letting

ft(a) =ui(z′) I(Nt−1 = 0)

Ñt−1(a)+

∑t−1n=1 I(a ∈ ξn) uin

Ñt−1(a),

and

gt(a) =εat − ui(z′) I(Nt−1 = 0)

Ñt−1(a),

where i = i(a), it is also easy to see that gt(a) has full support and that ft(a) ∈[Ci, Ci] almost surely. For any ε > 0,

P (|gt+1(a)| > ε | Ft) = P(|εat+1 − ui(z′) I(Nt = 0)| > εÑt(a) | Ft

)

= P(|εa − ui(z′) I(Nt = 0)| > εÑt(a) | Ft

)

→ 0 on {Nt(a) →∞} .

9

So, (A3) is satisfied. Lastly, since

ui(z′) I(Nt−1 = 0)Ñt−1(a)

→ 0

on {Nt(a) →∞}, (A4) is readily satisfied.

For more interesting example, consider the next model, which we call simple

recollection model. This model tries to capture the following primitive learning

behavior. When evaluating an action, players try to remember what the payoffs

had been in the previous periods in which that action had been chosen and assign

one of the past payoffs as the value of that action. Naturally, the more a player

receives a particular payoff after playing action a, the more likely it is that the

value attached to action a is that particular payoff. Players are assumed to have

imperfect memory so that there is always a chance that players make an erroneous

recall. However, the chance of making an error when evaluating an action decreases

as the number of times that action has been taken by the player increases.

For all a ∈ A, let {ηat : t ∈ T} be independent copies of uniform[0, 1] randomvariable, and let {εat : t ∈ T} be independent copies of an independent randomvariable εa that has full support. Let τak denote the period in which action a was

chosen for the k-th time. The valuation process {vt : t ∈ T} is given formally by:

vt(a) =Nt−1(a)∑

k=1

I

(ηat ∈

(k

1 + Nt−1(a),

k + 11 + Nt−1(a)

])uiτak

+ I(

ηat ≤1

1 + Nt−1(a)

)εat ,

where i = i(a).

This process behaves as if there is an urn, or a memory bank, corresponding to

each action. Each urn initially contains one card, called the wild card. Suppose

node z is reached during the course of the t-th play. Player i(z) assigns a value vt(a)

to each a ∈ Az by drawing a card from the urn corresponding to action a. If a card

10

that is drawn is a wild card then the value assigned to the action is the outcome of

a draw from an independent random variable εat . If the card is not a wild card then

the value assigned to the action is the pre-recorded value on the card. In either case

the card is placed back into the urn once the value has been assigned to a.

If during the course of the t-th play action a was chosen, then at the end of the

t-th play player i(a)’s payoff in the t-th play is recorded on a new card and placed

into the urn for that action. Thus, each time action a is chosen, the number of

cards in the corresponding urn increases by one. Therefore, the chance of receiving

a random valuation declines as players gain experience but never disappears. On

the other hand, the distribution of vt(a) converges to the empirical distribution of

the payoffs received when action a was chosen as the number of the times in which

action a is chosen goes to infinity. For example, if player i receives the same payoff

u after choosing action a for all but finitely many times, then vt(a) converges to u

with probability one as t goes to infinity.

Proposition 2. Simple recollection model satisfies assumptions (A1)-(A4).

Proof. It is easy to see that simple recollection model satisfies (A1) and (A2). Fix

any z′ ∈ GT . Letting

ft(a) = I(

ηat ≤1

1 + Nt−1(a)

)ui(z′) +

Nt−1(a)∑

k=1

I

(ηat ∈

(k

1 + Nt−1(a),

k + 11 + Nt−1(a)

])uτak ,

and

gt(a) = I(

ηat ≤1

1 + Nt−1(a)

)(εat − ui(z′)

),

where i = i(a), it is also easy to see that gt(a) has full support and that ft(a) ∈[Ci, Ci] almost surely. For any ε > 0,

P (|gt+1(a)| > ε | Ft) = P(

ηat+1 ≤1

1 + Nt(a)and |εat+1 − ui(z′)| > ε

∣∣∣ Ft)

= P(

ηa ≤ 11 + Nt(a)

∣∣∣ Ft)

P(|εa − ui(z′)| > ε

∣∣∣ Ft)

→ 0 on {Nt(a) →∞} .

11

So, (A3) is satisfied.

Lastly, on{Pt

n=1 I(a∈ξn)I(uin=u)Nt(a)

→ 1 as Nt(a) →∞}

,

I

(ηat ≤

11 + Nt−1(a)

)ui(z′) → 0

andNt−1(a)∑

k=1

I

(ηat ∈

(k

1 + Nt−1(a),

k + 11 + Nt−1(a)

])uτak → u

as t →∞. So, (A4) is satisfied.

3.2 Main Results for Additively Separable Valuation Process

Theorem 1 shows that valuation processes satisfying assumptions (A1)-(A3) gen-

erate sufficient exploration by inducing players to play every action in the game

infinitely often with probability one. In particular, this means that SPNE paths oc-

cur infinitely often. Of course, this is not a strong result since the same theorem also

shows that every path occurs infinitely often as well. So, this theorem also serves

as a negative result showing that the probability of non-SPNE paths occurring only

finitely many times is zero.

Theorem 1. Let G be a finite perfect-information game. Suppose an additively

separable valuation process {vt : t ∈ T} satisfies assumptions (A1)-(A3). Then∀a ∈ A, Nt(a) →∞ almost surely as t →∞.

Given the weak restriction placed on valuation processes by assumption (A3), it

should not be surprising that more restriction is needed to obtain positive conver-

gence results. The additional restriction required is provided by assumption (A4).

Because non-SPNE paths occur infinitely often almost surely, it is clear that the

play cannot converge almost surely to a SPNE path. However, Theorem 2 shows

that the play can converge to a SPNE path in probability so that the occurrence of

non-SPNE paths becomes increasingly rare.

12

Since ties in the payoffs can present difficulties, the convergence results in The-

orem 2 are limited to the following class of games. Let Γ be the collection of all

finite perfect-information games such that ∀z′, z′′ ∈ GT , ui(z′) = ui(z′′) if and onlyif uj(z′) = uj(z′′) for all j. So, Γ includes games with generic payoffs, which have no

ties in the payoffs, and games where ties are irrelevant like “win-lose-draw” games.

Theorem 2 shows that if G ∈ Γ and the valuation process satisfies assumptions (A1)-(A4), then the probability of playing a SPNE path converges to one as the number

of plays goes to infinity. In addition, the fraction of time in which a SPNE path is

played converges to one almost surely.7

Theorem 2. Let G ∈ Γ. Suppose an additively separable valuation process {vt : t ∈T} satisfies assumptions (A1)-(A4). Then the probability of playing a SPNE pathconverges to one, and the fraction of time a SPNE path of G is played converges to

one almost surely as t →∞.

Theorem 8 in the appendix show that assumptions (A1)-(A4) are special cases

of general assumptions (GA1)-(GA4). Therefore, Theorem 1 and Theorem 2 follow

from corresponding theorems for general valuation processes, Theorems 3 and 6 in

Appendix A. Rather than discuss the general theorems and proofs here, we defer

their discussion to the appendix and instead develop intuition for the results using

the simple recollection model.

Consider the 2-player game given in Figure 1. Since there are only two actions

at the root node, one of the two actions, say action L1, must be played infinitely

often. Since the valuation vt(L1) takes the random wild card value with probability

11+Nt−1(L1) , the probability of vt(L1) taking a wild card value goes to zero as the

number of plays goes to infinity. So, the probability that vt(L1) = 3 converges to

one.

7Moreover, as shown in Theorem 6, of which Theorem 2 is a special case, this is true for allsubgames of G as well. That is, the probability of playing a SPNE path of Gz when node z isreached converges to one. Also, the ratio of the number of times in which a SPNE path of Gz isplayed to the number of times in which node z is reached converges to one almost surely. Theplayers, therefore, learn the optimal action at every node and not just at nodes on a SPNE path.

13

.................................................................................................................................................................................................................................................................................................................................................................................................

.................................................................................................................................................................................................................................................................................................................................................................................................

.................................................................................................................................................................................................................................................................................................

.................................................................................................

.................................................................................................

•1(z0)

• 2(z1)

•1(z2) • 1(z3)

L1 R1

L2 R2

L3 R3 L4 R4

(37

) (49

) (26

) (18

) (05

)

•

Figure 1: Example

Now, suppose action R1 occurs only finitely often, say M − 1 many times, sothat time T is the last time action R1 occurs. Then, for all t > T , the probability

that vt(R1) takes a random wild card value is at least 1M . So, for all large t > T ,

the probability that vt(R1) is greater than vt(L1), which is at least P (vt(L1) = 3)×P (wild card is drawn at R1) × P (wild card at R1 > 3), is approximately P (ε

R1>3)M .

Since this is true for every large t > T , it is not possible that the event {vt(R1) >vt(L1)} never occurs after time T , contrary to our assumption. Therefore, action R1must occur infinitely often. Similar intuition shows that at each node, every action

must occur infinitely often. Of course, this argument is not entirely correct since,

among other things, T is random and the events considered here are not independent

events. The proof of Theorem 3 provides a formal argument.

Next, the convergence of the play to the SPNE path can be demonstrated by an

induction argument. Consider node z2. Since both actions L3 and R3 are played

infinitely often, the probability that vt(L3) = 4 and vt(R3) = 2 converges to 1

as t → ∞. So, the probability that player 1 plays action L3 converges to one.Likewise, the probability that player 1 plays action L4 converges to one as well. So,

the probability that player 2 receives payoff 9 after choosing L2 converges to one,

which in turn makes the probability that vt(L2) = 9 converge to one. Likewise, the

probability that vt(R2) = 8 converges to one as well. Therefore, the probability that

14

player 2 chooses action L2 converges to one. Through induction, it is not hard to

see that the probability that player 1 will choose R1 converges to one. Thus, the

play converges to the SPNE of the game. Since not all the subgames are played at

every period, complications arise in formalizing this induction argument. The proof

of Theorem 6 solves this problem with the use of stopping times.

3.3 Finite Time Properties: Simulation Results

The theoretical results obtained here are that of asymptotic convergence. However,

often it is argued that asymptotic results, while reassuring, have little practical im-

plications.8 In light of such views, this section concludes with simulation results that

roughly assess the finite time properties of the two valuation processes. Specifically,

1000 copies of the simple 3 player game given in Figure 2 were generated, each with

payoffs drawn randomly from normal distribution with mean zero and variance 100.

The learning behavior of simple recollection model and sample averaging model on

each of these games, both with εat ’s drawn from normal distribution with mean zero

and variance 100, were simulated and compared to the simulated learning behavior

generated by the sample averaging ε-greedy models and the sample averaging greedy

(ε = 0) model.

For each simulation, the fraction of time the SPNE outcome had occurred up

to the t-th play was tracked for each t ≤ 2000 in each of the learning models. Thegraph in Figure 3 presents the average of these ratios over 1000 simulations for

each learning model. The results suggest that the simple recollection model and the

sample averaging model not only reach reasonable optimality relatively quickly but

also outperform the ε-greedy models quickly.

Since simple recollection model starts with greater randomness than ε-greedy

8See, for example, the remarks on the use of eventual convergence to optimal behavior in mea-suring learning performance in Kaelbling, Littman, and Moore [5]. As another example, Ellison [2]questions the relevance of the asymptotic convergence results of Kandori, Mailath, and Rob [6]since the time required before the system reaches its limit is beyond the time scale relevant forhumans.

15

..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

..............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

......................................................................... ................................

................................

.........

........................................................................................................................................................................................................................... ................................

................................

................................

................................

................................

................................

...........................

......................................................................... ................................

................................

.........

•1

•2

•2

•3

•3

•3

•3

(u11u12u13

) (u21u22u23

) (u31u32u33

) (u41u42u43

) (u51u52u53

) (u61u62u63

) (u71u72u73

) (u81u82u83

)

Figure 2: Test Problem

0 200 400 600 800 1000 1200 1400 1600 1800 20000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

# Plays

# S

PN

E p

aths

/ #

Pla

ys

SR

SA

E 0.1

E 0.01

GR

SA: Sample Averaging SR: Simple Recollection EG 0.1:Epsilon Greedy, epsilon = 0.1 EG 0.01:Epsilon Greedy, epsilon = 0.01GR:Greedy

Figure 3: Simulation Results

16

models, it initially performs worse than ε-greedy models. However, whereas ε-

greedy models are forced to explore with the same probability no matter how much

experience the players have gained, simple recollection model allows the exploration

probabilities to decline with experience, so it outperforms the ε-greedy models even-

tually. In these simulations, this occurs before the 400th play.

In addition, these simulations also illustrate the inherent tension between finite

time and limit behavior present in ε-greedy models. As discussed before, while

smaller values of ε leads to better limit behavior, it comes at the price of poorer

finite time performance. Indeed, the simulations show that while ε-greedy model

with ε = 0.01 will eventually have a better limit behavior than the model with

ε = 0.1, it performs worse for the first 2000 periods considered. Sample averaging

model with endogenous exploration probability introduced in this paper solves this

tension and performs better in both finite time and the limit.

Lastly, to see the importance of maintaining exploration, consider the learning

curve corresponding to the greedy method. Because the greedy method starts out

with no randomness, the greedy method initially performs as well as or better than

the other methods. However, because it never explores, it is trapped in taking

suboptimal actions in the later plays.

4 Concluding Remarks

This paper studies sufficient conditions on the valuation updating rule that guaran-

tee that the play converges to a subgame-perfect Nash equilibrium in finite perfect-

information games with no relevant ties in the payoffs. The restrictions we identify

are mild enough to contain interesting and plausible learning behavior. The two ex-

amples of such valuation updating rule we provide correspond to learning behaviors

that are primitive but still satisfy the two basic principles of learning outlined in

Erev and Roth [3]: Law of Effect (Thorndike [10]) and the Power Law of Practice

17

(Blackburn [1]), which assert, respectively, that the more often an action leads to

a good outcome the more likely it is that this action will be chosen in the future,

and that learning curves rise steeply initially but flatten out later. The very naivete

of the learning behaviors associated with these models suggests that in this class of

games, the extent of rationality that is needed to support a subgame-perfect Nash

equilibrium may be minimal.

18

A General Valuation Process

This section presents the restrictions on general valuation updating rules that guar-

antee convergence to a SPNE. In particular, additive separability assumption is

dropped. Let (Ω,F , P ) be the probability space on which a valuation process{vt : t ∈ T} is defined. Let F0 = {∅, Ω), and let Ft = σ(v1, ..., vt) be thesub-σ-field consisting of events up to time t. For any action a ∈ A, let τan de-note the time of the n-th occurrence of action a. That is, τa0 = 0 and ∀n ∈ T,τan = inf{t > τn−1 : a ∈ ξt}. Then, {τan : n ∈ Z+} is a sequence of stopping times.The following facts about stopping times are used throughout this section. For any

stopping time τ , Fτ = {B ∈ F : ∀n B ∩ {τ ≤ n} ∈ Fn} is a σ-field. Supposeτ0 < τ1 < τ2 < ... almost surely, then {Fτn : n ∈ Z+} is a filtration. Moreover, if{Yt : t ∈ Z+} is adapted to {Ft : t ∈ Z+}, then Yτn is adapted to {Fτn : n ∈ Z+},and if Yt → Y almost surely as t →∞, then Yτn → Y almost surely as n →∞.

Each of the four conditions (GA1)-(GA4) below plays the same role as the corre-

sponding assumptions (A1)-(A4) for additively separable valuations. Assumptions

(GA1) and (GA2) formalize how valuation process for reinforcement learning should

behave: (GA1) requires that valuations of actions are independent when conditioned

on the past history, and (GA2) requires that the conditional distribution of valuation

of an action changes only after that action has been taken. The substantive con-

ditions are assumptions (GA3) and (GA4): (GA3) guarantees that every action is

taken infinitely often, and (GA4) provides the additional condition that guarantees

that the play path converges to a SPNE path.

Assumption (GA1). For all a ∈ A, a′, a′′ ∈ Aζ(a) with a′ 6= a′′, and Borel B′, B′′ ⊂R,

P(vτan+1(a

′) ∈ B′ and vτan+1(a′′) ∈ B′′ | Fτan)

= P(vτan+1(a

′) ∈ B′ | Fτan)

P(vτan+1(a

′′) ∈ B′′ | Fτan)

almost surely.

19

Assumption (GA2). For all a ∈ A, a′ ∈ Aζ(a), and Borel B ⊂ R,

P(vτan+2(a

′) ∈ B | Fτan+1)

= P(vτan+1(a

′) ∈ B | Fτan)

on{

a′ 6∈ ξτan+1}

.

In the following, let C = 1 + max{|ui(z)| : i ∈ I, z ∈ GT }, where I is the set ofplayers of G.9

Assumption (GA3). For all a ∈ A and a′ ∈ Aζ(a),

{P (vτan+1(a

′) ≥ C | Fτan ) → 0 as n →∞}

={Nτan (a

′) →∞ as n →∞} .

To interpret assumption (GA3), first note that the event {vτan (a′) ≥ C} is theevent that the valuation of action a′ is “overly optimistic” and

{Nτan (a

′) →∞ as n →∞}

is the event that the number of times action a′ is taken goes to infinity. Therefore,

assumption (GA3) essentially requires that the probability of a player having an

overly optimistic valuation of an action declines to zero if and only if the number of

times a′ is sampled goes to infinity.10

Assumption (GA4). For all a ∈ A, i = i(ζ(a)), and a′, a′′ ∈ Aζ(a), if there existconstants u′ > u′′ such that

∑tn=1 I(a

′ ∈ ξn)I(uin = u

′)

Nt(a′)→ 1 a.s. as Nt(a′) →∞

and ∑tn=1 I(a

′′ ∈ ξn)I(uin = u

′′)

Nt(a′′)→ 1 a.s. as Nt(a′′) →∞,

then

P(vτan+1(a

′) > vτan+1(a′′)

∣∣∣ Fτan)→ 1 a.s. as n →∞.

9It is not hard to see from the proof of Theorem 3 that C need not be this large. It is enoughfor C to be the player i(ζ(a))’s highest possible payoff in the subgame Gζ(a). The particular choiceof C made here simplifies notation and the proofs.

10A simpler way to state this requirement would have been the expression “P (vτan (a′) ≥ C) → 0

if Nτn(a′) → ∞.” However, because the events {vτan (a′) ≥ C}, n ∈ T, are not independent, a

conditional version of this idea as stated in assumption (GA3) is needed.

20

To interpret assumption (GA4), suppose that the fraction of time a player re-

ceives some payoff u′ when an action a′ is taken and the fraction of time that the

player receives some payoff u′′ when an action a′′ is taken both converges to one,

where u′ > u′′. Assumption (GA4) then requires that the conditional probability of

the player’s valuation of action a′ being higher than the valuation of action a′′ goes

to one.

Theorem 3 shows that valuation processes satisfying assumptions (GA1)-(GA3)

induce players to play every action in the game infinitely often with probability one.

Theorem 3. Let G be a finite perfect-information game. Suppose a valuation pro-

cess {vt : t ∈ T} satisfies assumptions (GA1)-(GA3). Then ∀a ∈ A, Nt(a) → ∞almost surely as t →∞.

Proof. Take any a ∈ A, and let z = ζ(a). The proof proceeds by showing thatif a ∈ ξt infinitely often a.s., then ∀a′ ∈ Az, Nt(a′) → ∞ a.s. So, assume a ∈ ξtinfinitely often a.s. so that τan < ∞ a.s. for all n. Since |Az| < ∞ there mustbe at least one action in Az that occurs infinitely often a.s. Suppose, towards

contradiction, that there exists nonempty Âz ⊂ Az such that P (B) > 0, where B isthe event {∀a′ ∈ Âz, a′ ∈ ξt f.o. and ∀a′ ∈ Az\Âz, a′ ∈ ξt i.o.}.

Fix â ∈ Âz, and let constant C be as in assumption (GA3). Then there existsY > 0 a.s. such that P

(vτan+1(â) ≥ C | Fτan

)≥ Y for all n on B. Moreover,

P(vτan+1(a

′) ≥ C | Fτan)→ 0 for all a′ ∈ Az\Âz on B. Therefore, on B,

∞∑

n=1

P

(vτan+1(â) > max

a′∈Az\Âz{vτan+1(a′)}

∣∣∣ Fτan)

≥∞∑

n=1

P(vτan+1(â) ≥ C and ∀a′ ∈ Az\Âz, vτan+1(a′) < C

∣∣∣ Fτan)

.

By conditional independence,

=∞∑

n=1

P

(vτan+1(â) ≥ C

∣∣∣ Fτan) ∏

a′∈Az\ÂzP

(vτan+1(a

′) < C∣∣∣ Fτan

)

21

≥∞∑

n=1

Y

∏

a′∈Az\Âz

(1− P

(vτan+1(a

′) ≥ C∣∣∣ Fτan

))

= ∞.

=⇒ vτan+1(â) > maxa′∈Az\Âz

{vτan+1(a′)} i.o. by the conditional Borel-Cantelli Lemma.

However, maxa′∈Az\Âz{vτan+1(a′)} > maxa′∈Âz{vτan+1(a′)} for all but finitely manytimes on B. Since â ∈ Âz, this is a contradiction. Thus, ∀a′ ∈ Az, Nt(a′) →∞ a.s.as t →∞.

Finally, since a0 is the null action, a0 ∈ ξt infinitely often a.s. So, the theoremfollows by induction.

The following two corollaries follow trivially from Theorem 3, so the proofs are

omitted.

Corollary 4. If the assumptions of Theorem 3 are satisfied, every SPNE path occurs

infinitely often with probability one.

Corollary 5. If the assumptions of Theorem 3 are satisfied, the probability of SPNE

paths occurring for all but finitely many times is zero.

Recall that Γ is defined as the collection of all finite perfect-information games

such that ∀z′, z′′ ∈ GT , ui(z′) = ui(z′′) if and only if uj(z′) = uj(z′′) for all j.Theorem 6 shows that if G ∈ Γ and the valuation process satisfies assumptions(GA1)-(GA4), then the probability of playing a SPNE path converges to one as the

number of plays goes to infinity. In addition, the fraction of time in which a SPNE

path is played converges to one almost surely as the number of plays goes to infinity.

Moreover, Theorem 6 shows that this is true for all subgames of G as well. Thatis, the probability of playing a SPNE path of Gz when node z is reached convergesto one as the number of times z is played goes to infinity. Also, the ratio of the

number of times in which a SPNE path of Gz is played to the number of times inwhich node z is reached converges to one almost surely as the number of plays goes

22

to infinity. The players, therefore, eventually learn the optimal action at every node

and not just at nodes on a SPNE path.

In the following, if a ∈ ξ let ξa = ξ ∩ {a′ ∈ Az′ : z′ ∈ Gζ(a)} denote the continua-tion path of ξ. Let Ξa = {path ξ of G : a ∈ ξ and ξa is a SPNE path of Gζ(a)}, andlet St(a) =

∑tn=1 I(ξn ∈ Ξa).

Theorem 6. Suppose G ∈ Γ and the valuation process {vt : t ∈ T} satisfies as-sumptions (GA1)-(GA4). Then for all a ∈ A, P (ξτan ∈ Ξa) → 1 as n → ∞, andSt(a)Nt(a)

→ 1 a.s. as t →∞.11

Proof. The proof proceeds by induction on subgames of G. So, let L(Gz) denote themaximum length of a path in Gz. For all z ∈ G \GT , let Ãz = {ã ∈ Az : there existsa SPNE of Gz in which ã occurs with positive probability}.

Let a ∈ A be such that L(Gζ(a)) = 1. Let z = ζ(a) and i = i(z). By Theo-rem 3, ∀a′ ∈ Az, Nt(a′) → ∞ a.s. Since L(Gz) = 1, ζ(a′) ∈ GT for all a′ ∈ Az.So, I(a′ ∈ ξt)I

(uit = u

i(ζ(a′)))

= I(a′ ∈ ξt) a.s. So,Pt

n=1 I(a′∈ξn)I(uin=ui(ζ(a′)))

Nt(a′) =Ptn=1 I(a

′∈ξn)Nt(a′) = 1 a.s. for all t ∈ T. Let ã ∈ Ãz. Then, a.s.,

1 ≥ E[I(ξτan+1 ∈ Ξa) | Fτan ]

= P(ξτan+1 ∈ Ξa) | Fτan

)

≥ P(vτan+1(ã) > vτan+1(a

′) for all a′ ∈ Az \ Ãz∣∣∣ Fτan

)

=∏

a′∈Az\ÃzP

(vτan+1(ã) > vτan+1(a

′)∣∣∣ Fτan

)by conditional independence.

→ 1 as n →∞ by assumption (GA4).

Therefore, E[I(ξτan+1 ∈ Ξa) | Fτan ] → 1 a.s. as n → ∞, and P (ξτan ∈ Ξa) → 1 as

11The first version of this theorem showed that St(a)Nt(a)

→ 1 in expectation. It was not until I sawthe use of a stability theorem in Jehiel and Samet [4] that I realized that the proof can be easilystrengthened to show almost sure convergence. I would like to thank them for their insight. Theversion of the stability theorem used is stated and proved as Lemma B.2.

23

n →∞ by the dominated convergence theorem. Next,

∑nk=1 E

[I(ξτak+1 ∈ Ξa) | Fτk

]

n→ 1 a.s. as n →∞ by Lemma B.1,

and

∑nk=1

(I(ξτak+1 ∈ Ξa)− E

[I(ξτak+1 ∈ Ξa) | Fτk

])

n→ 0 a.s. as n →∞ by Lemma B.2.

=⇒∑n

k=1 I(ξτak ∈ Ξa)n

→ 1 a.s. as n →∞.

=⇒ St(a)Nt(a)

=∑t

n=1 I(ξn ∈ Ξa)Nt(a)

=

∑Nt(a)k=1 I(ξτak ∈ Ξa)

Nt(a)→ 1 a.s. as t →∞.

Next, suppose for all subgames Gζ(a) of G such that L(Gζ(a)) ≤ m, E[I(ξτan+1 ∈Ξa) | Fτan ] → 1 a.s. as n →∞ and St(a)Nt(a) → 1 a.s. as t →∞. Let a ∈ A be such thatL(Gζ(a)) = m + 1. Let z = ζ(a) and i = i(z). By Theorem 3, ∀a′ ∈ Az, Nt(a′) →∞a.s. Then, for any ã ∈ Ãz, a.s.,

1 ≥ P(vτan+1(ã) > vτan+1(a

′) for all a′ ∈ Az \ Ãz∣∣∣ Fτan

)

=∏

a′∈Az\ÃzP

(vτan+1(ã) > vτan+1(a

′)∣∣∣ Fτan

)by conditional independence.

By the induction hypothesis St(a′′)

Nt(a′′) → 1 a.s. ∀a′′ ∈ Az. So,

→ 1 as n →∞ by assumption (GA4).

Therefore, a.s.,

1 ≥ E[I(ξτan+1 ∈ Ξa) | Fτan ]

= E

∑

ã∈ÃzI

(vτan+1(ã) = maxa′∈Az

{vτan+1(a′)})

I(ξτan+1 ∈ Ξã

) ∣∣∣∣ Fτan

= E

∑

ã∈ÃzI

(vτan+1(ã) = maxa′∈Az

{vτan+1(a′)}) ∑

ã∈ÃzI

(ξτan+1 ∈ Ξã

) ∣∣∣∣ Fτan

= P

(maxã∈Ãz

{vτan+1(ã)

}> max

a′∈Az\Ãz

{vτan+1(a

′)} ∣∣∣ Fτan

) ∑

ã∈ÃzP (ξτan+1 ∈ Ξã | Fτan )

→ 1 as n →∞ by the induction hypothesis.

24

So, E[I(ξτan+1 ∈ Ξa) | Fτan ] → 1 a.s. as n →∞. Then, P (ξτan ∈ Ξa) → 1 as n →∞,and St(a)Nt(a) → 1 a.s. as t →∞ by the same argument as before.

Corollary 7. Let G be a finite perfect-information game with generic payoffs, andlet valuation process {vt : t ∈ T} satisfy assumptions (GA1)-(GA4). Then theprobability of playing the SPNE path converges to one, and the fraction of time the

SPNE path of G is played converges to one almost surely as t →∞.

Proof. Since G has a unique SPNE path, the corollary follows trivially from Theo-

rem 6.

We close with a theorem showing that additive separability and assumptions (A1)-

(A4) are special case of general assumptions (GA1)-(GA4).

Theorem 8. Suppose an additively separable valuation process {vt : t ∈ T} satisfiesassumptions (A1)-(A3). Then the valuation process satisfies the general assumptions

(GA1)-(GA3). If, in addition, the process satisfies assumption (A4), then it also

satisfies general assumption (GA4).

Proof. Let a = a0. Since a is the null action, τan < ∞ a.s. Then by (A1) and strongMarkov property, for all a′, a′′ ∈ Gζ(a) with a′ 6= a′′, condition (GA1) is satisfied.Likewise, by (A2) and strong Markov property, for all a′ ∈ Gζ(a), condition (GA2)is satisfied. Let i = i(ζ(a)). For all a′ ∈ Aζ(a),

P(vτan+1(a

′) ≥ C | Fτan)

≤ P(fτan+1(a

′) ∈ [Ci, Ci] and gτan+1(a′) ≥ C − Ci | Fτan)

= P(fτan+1(a

′) ∈ [Ci, Ci] | Fτan)

P(gτan+1(a

′) ≥ C − Ci | Fτan)

= P(gτan+1(a

′) ≥ C − Ci | Fτan)

→ 0 as n →∞ on {Nτan (a′) →∞ as n →∞}

by (A3) and strong Markov property. Conversely, on{

P(vτan+1(a

′) ≥ C | Fτan)→ 0

},

Nτan (a′) → ∞ since the conditional distribution of vτan+1(a′) differs from that of

vτan (a′) only if a′ ∈ ξτn . Therefore, condition (GA3) is satisfied.

25

The induction step in the proof of Theorem 3, shows that then τa′

n < ∞ a.s. forall a′ ∈ Aζ(a). Therefore, by induction, {vt(a) : t ∈ T} satisfies (GA1)-(GA3) for alla ∈ A.

For (GA4), consider any a ∈ A, and let i = i(ζ(a)). Suppose for some a′ and a′′

in Aζ(a), there exist constants u′ > u′′ such that

∑tn=1 I(a

′ ∈ ξn)I(uin = u

′)

Nt(a′)→ 1 a.s. as Nt(a′) →∞,

and ∑tn=1 I(a

′′ ∈ ξn)I(uin = u

′′)

Nt(a′′)→ 1 a.s. as Nt(a′′) →∞.

By (A4), ft(a′) → u′ and ft(a′′) → u′′ a.s. as t →∞. Let ε = u′−u′′3 . Then a.s.,

P(vτan (a

′) > vτan (a′′)

∣∣∣ Fτan−1)

≥ P(|gτan (a′)| < ε and |gτan (a′′)| < ε

∣∣∣ Fτan−1)

× P(fτan (a

′) ∈ [u′ − ε, u′ + ε] and fτan (a′′) ∈ [u′′ − ε, u′′ + ε]∣∣∣ Fτan−1

)

→ 1 a.s. as n →∞.

B Stability Lemmas

This section presents the lemmas that are used in the proof of Theorem 6.12

Lemma B.1. Let Zn → 1 a.s. as n → ∞, and 0 ≤ Zn ≤ 1 a.s. ∀n. Then,Pnk=1 Zk

n → 1 a.s. as n →∞.

Proof. Consider any ω ∈ Ω such that Zn(ω) → 1 as n →∞. Fix any ε > 0. Then,12The stability theorem is a well known fact that often appears in standard probability textbook

as an exercise. However, since a proof for the version that is needed could not be located, it ispresented as Lemma B.2 and a proof is provided.

26

∃Nε > 0 such that ∀n > Nε, Zn(ω) ≥ 1− ε. Then, ∀n > Nε,

∑nk=1 Zk(ω)

n=

∑Nεk=1 Zk(ω)

n+

∑nk=Nε+1

Zk(ω)n

≥∑Nε

k=1 Zk(ω)n

+(n−Nε)(1− ε)

n

So, ∀ε > 0,

lim inf∑n

k=1 Zk(ω)n

≥ lim inf(∑Nε

k=1 Zk(ω)n

+(n−Nε)(1− ε)

n

)= 1− ε.

Therefore,Pn

k=1 Zk(ω)n → 1, and

Pnk=1 Zk

n → 1 a.s.

Lemma B.2. Let {Zn : n ∈ T} be adapted to filtration {Fn : n ∈ T}, and letF0 = {∅, Ω}. Suppose there exists a constant M such that |Zn| < M a.s. for all n.Then,

Pnk=1(Zk−E[Zk | Fk−1])

n → 0 a.s. as n →∞.

Proof. Let Yn =∑n

k=1Zk−E[Zk | Fk−1]

k . Then, EY2n < ∞. Also, a.s.,

E[Yn+1 | Fn]

=n+1∑

k=1

E[Zk | Fn]−E[E[Zk | Fk−1] | Fn]k

=E[Zn+1 | Fn]− E[Zn+1 | Fn]

n + 1+

n∑

k=1

E[Zk | Fn]−E[E[Zk | Fk−1] | Fn]k

=n∑

k=1

Zk −E[Zk | Fk−1]k

= Yn since ∀k ≤ n, Zk is Fn-measurable and Fk−1 ⊂ Fn.

So, {Yn : n ∈ T} is a martingale with bounded second moments. By the martingaleconvergence theorem, there exists Y such that

n∑

k=1

Zk − E[Zk | Fk−1]k

= Yn → Y a.s. as n →∞.

Then by Kronecker’s lemma,Pn

k=1(Zk−E[Zk | Fk−1])n → 0 a.s. as n →∞.

27

References

[1] Blackburn, J. (1936). Acquisition of Skill: An Analysis of Learning Curves.

IHRB Report, No. 73.

[2] Ellison, G. (1993). Learning, Local Interaction, and Coordination. Economet-

rica, 61: 1047-1072.

[3] Erev, I. and A. Roth. (1998). Predicting How People Play Games: Reinforce-

ment Learning in Experimental Games with Unique, Mixed Strategy Equilibria.

American Economic Review, 88: 848-881.

[4] Jehiel, P. and D. Samet. (2000). Learning to Play Games in Extensive Form by

Valuation. Working Paper.

[5] Kaelbling, L., M. Littman, and A. Moore. (1996). Reinforcement Learning: A

Survey. Journal of Artificial Intelligence Research, 4: 237-285.

[6] Kandori, M., G. Mailath, and R. Rob. (1993). Learning, Mutation, and Long

Run Equilibria in Games. Econometrica, 61: 29-56.

[7] Laslier, J-F. and B. Walliser. (2005) A Reinforcement Learning Process in Ex-

tensive Form Games. International Journal of Game Theory, 33: 219-227.

[8] Sutton, R. (ed.). (1992). Reinforcement Learning. Boston: Kluwer Academic

Publishers.

[9] Sutton, R. and A. Barto. (1998). Reinforcement Learning: An Introduction.

Cambridge: MIT Press.

[10] Thorndike, E. (1911). Animal Intelligence. Macmillan: New York.

28

Date post:	22-Oct-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Reinforcement Learning in Perfect-Information...

Documents