+ All Categories
Home > Documents > Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Date post: 11-Dec-2016
Category:
Upload: oscar
View: 212 times
Download: 0 times
Share this document with a friend
17
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited. SIAM J. CONTROL OPTIM. c 2009 Society for Industrial and Applied Mathematics Vol. 48, No. 3, pp. 1405–1421 ASYMPTOTICALLY OPTIMAL STRATEGIES FOR ADAPTIVE ZERO-SUM DISCOUNTED MARKOV GAMES J. ADOLFO MINJ ´ AREZ-SOSA AND OSCAR VEGA-AMAYA Abstract. We consider a class of discrete-time two person zero-sum Markov games with Borel state and action spaces, and possibly unbounded payoffs. The game evolves according to the recursive equation x n+1 = F (xn,an,bnn), n =0, 1,... , where the disturbance process {ξn} is formed by independent and identically distributed R k -valued random vectors, which are observable but whose common density ρ is unknown to both players. Under certain continuity and compactness conditions, we combine a nonstationary iteration procedure and suitable density estimation methods to construct asymptotically discounted optimal strategies for both players. Key words. zero-sum Markov games, discounted payoff, adaptive strategies, asymptotic opti- mality AMS subject classifications. 91A15, 62G07 DOI. 10.1137/060651458 1. Introduction. We consider a discrete-time two person zero-sum Markov game whose state process {x n } evolves according to the equation (1) x n+1 = F (x n ,a n ,b n n ), n =0, 1,..., where (a n ,b n ) represents the actions chosen by players 1 and 2, respectively, at time n. The disturbance process {ξ n } is an observable sequence of independent and identically distributed random vectors in R k with density ρ which is unknown to both players. The evolution of the game is as follows. At each stage n, on the knowledge of the state x n as well as the history of the game, the players independently of each other choose actions a n and b n . Then player 1 receives a payoff r(x n ,a n ,b n ) from player 2, and the game jumps to a new state x n+1 according to the transition law determined by (1). The payoffs are accumulated throughout the evolution of the game in an infinite horizon, and the players measure the performance of their strategies to choose actions using a total discounted payoff. However, since the density ρ is unknown to both players, they have to estimate ρ and then adapt the strategies to their estimation. The strategies that combine estimation and optimization procedures are called adaptive strategies. It is well known that the discounted payoff criterion depends strongly on the decisions selected at the first stages (precisely when the information about the density ρ is deficient to the players), so neither player 1 nor player 2 can ensure, in general, the existence of discounted optimal strategies. Thus, the discounted optimality of an adaptive strategy will be analyzed in an asymptotic sense. The notion of asymptotic optimality used in the present paper was adapted from Sch¨ al [20], who introduces this concept to study adaptive Markov control processes. Received by the editors February 2, 2006; accepted for publication (in revised form) January 15, 2009; published electronically April 15, 2009. This work was partially supported by Consejo Nacional de Ciencia y Tecnologia (CONACyT) under grant 46633-F. http://www.siam.org/journals/sicon/48-3/65145.html Departamento de Matem´aticas, Universidad de Sonora, Rosales s/n, Centro, C. P. 83000, Her- mosillo, Sonora, M´ exico ([email protected]). Corresponding author. Departamento de Matem´aticas, Universidad de Sonora, Rosales s/n, Centro, C. P. 83000, Hermosillo, Sonora, M´ exico ([email protected]). 1405 Downloaded 08/05/13 to 130.236.83.211. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php
Transcript
Page 1: Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

SIAM J. CONTROL OPTIM. c© 2009 Society for Industrial and Applied MathematicsVol. 48, No. 3, pp. 1405–1421

ASYMPTOTICALLY OPTIMAL STRATEGIES FOR ADAPTIVEZERO-SUM DISCOUNTED MARKOV GAMES∗

J. ADOLFO MINJAREZ-SOSA† AND OSCAR VEGA-AMAYA‡

Abstract. We consider a class of discrete-time two person zero-sum Markov games with Borelstate and action spaces, and possibly unbounded payoffs. The game evolves according to the recursiveequation xn+1 = F (xn, an, bn, ξn), n = 0, 1, . . . , where the disturbance process {ξn} is formed byindependent and identically distributed Rk-valued random vectors, which are observable but whosecommon density ρ is unknown to both players. Under certain continuity and compactness conditions,we combine a nonstationary iteration procedure and suitable density estimation methods to constructasymptotically discounted optimal strategies for both players.

Key words. zero-sum Markov games, discounted payoff, adaptive strategies, asymptotic opti-mality

AMS subject classifications. 91A15, 62G07

DOI. 10.1137/060651458

1. Introduction. We consider a discrete-time two person zero-sum Markovgame whose state process {xn} evolves according to the equation

(1) xn+1 = F (xn, an, bn, ξn), n = 0, 1, . . . ,

where (an, bn) represents the actions chosen by players 1 and 2, respectively, at time n.The disturbance process {ξn} is an observable sequence of independent and identicallydistributed random vectors in Rk with density ρ which is unknown to both players.

The evolution of the game is as follows. At each stage n, on the knowledge of thestate xn as well as the history of the game, the players independently of each otherchoose actions an and bn. Then player 1 receives a payoff r(xn, an, bn) from player 2,and the game jumps to a new state xn+1 according to the transition law determined by(1). The payoffs are accumulated throughout the evolution of the game in an infinitehorizon, and the players measure the performance of their strategies to choose actionsusing a total discounted payoff. However, since the density ρ is unknown to bothplayers, they have to estimate ρ and then adapt the strategies to their estimation. Thestrategies that combine estimation and optimization procedures are called adaptivestrategies.

It is well known that the discounted payoff criterion depends strongly on thedecisions selected at the first stages (precisely when the information about the densityρ is deficient to the players), so neither player 1 nor player 2 can ensure, in general,the existence of discounted optimal strategies. Thus, the discounted optimality of anadaptive strategy will be analyzed in an asymptotic sense. The notion of asymptoticoptimality used in the present paper was adapted from Schal [20], who introduces thisconcept to study adaptive Markov control processes.

∗Received by the editors February 2, 2006; accepted for publication (in revised form) January 15,2009; published electronically April 15, 2009. This work was partially supported by Consejo Nacionalde Ciencia y Tecnologia (CONACyT) under grant 46633-F.

http://www.siam.org/journals/sicon/48-3/65145.html†Departamento de Matematicas, Universidad de Sonora, Rosales s/n, Centro, C. P. 83000, Her-

mosillo, Sonora, Mexico ([email protected]).‡Corresponding author. Departamento de Matematicas, Universidad de Sonora, Rosales s/n,

Centro, C. P. 83000, Hermosillo, Sonora, Mexico ([email protected]).

1405

Dow

nloa

ded

08/0

5/13

to 1

30.2

36.8

3.21

1. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 2: Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

1406 J. A. MINJAREZ-SOSA AND O. VEGA-AMAYA

Roughly speaking, our approach is as follows. First, imposing mild conditions onthe model, we show that the (nonadaptive) discounted game has a value as well asthe existence of optimal strategies for player 2 and ε-optimal strategies for player 1for each ε > 0. This is done showing that the Shapley (or dynamic programming)operator is a contraction on a certain measurable function space; thus the Banachfixed point theorem yields the Shapley equation and the convergence of the valueiteration algorithm (see Remark 5 and Theorem 6 in section 3). In a second step, weimpose additional conditions on the function F in (1) and on the density ρ, which allowa combination of a suitable density estimation method of ρ with the value iterationscheme to approximate the value function. The sequences of minimizers coming fromthe value iteration scheme yield asymptotically discounted optimal strategies for theplayers (see Theorem 12 in section 4).

The adaptive Markov control problem has received considerable attention (see, forinstance, [1, 6, 7, 8, 9, 11, 16, 17, 20] and the references therein), but for Markov gameswe know of only the papers by Najim, Poznyak, and Gomez [18] and Papavassilopoulos[19]. There are some important differences among the two latter papers and our work.The paper [18] deals with finite models and assumes that the dynamics of the playersare “uncoupled,” that is, each player controls his/her own state processes, while [19]considers only discrete-time deterministic scalar LQ games.

The remainder of the paper is organized as follows. In section 2, the game modelwe deal with is introduced. Next, in section 3, the (nonadaptive) game is analyzed, andthe results obtained there are used in section 4 to construct the adaptive strategies.In section 5, it is shown that the game model for reservoir operation satisfies all theassumption in the main result (Theorem 12). Finally, the proofs are presented insection 6.

2. The game model. Notation. Throughout the paper we shall use the follow-ing notation. Given a Borel space S—that is, a Borel subset of a complete separablemetric space— B(S) denotes the Borel σ-algebra and “measurability” always meansmeasurability with respect to B(S). The class of all probability measures on S isdenoted by P(S). Given two Borel spaces S and S′, a stochastic kernel ϕ(·|·) on Sgiven S′ is a function such that ϕ(·|s′) is in P(S) for each s′ ∈ S′, and ϕ(B|·) is ameasurable function on S′ for each B ∈ B(S). Moreover, R+ stands for the nonnega-tive real numbers subset and N (N0, resp.) denotes the positive (nonnegative, resp.)integer subset. Finally, L(S) stands for the class of lower semicontinuous functionson S bounded below.

The Markov game model. This paper is concerned with a zero-sum Markovgame modeled by

(X,A,B,KA,KB, F, ρ, r),

where X is the state space, and the sets A and B are the control spaces for players 1and 2, respectively. The constraint sets KA and KB are subsets of X×A and X×B,respectively. It is assumed that all these sets are Borel spaces. Thus, for each x ∈ X,the x-sections

A(x) := {a ∈ A : (x, a) ∈ KA},

B(x) := {b ∈ B : (x, b) ∈ KB}stand for the sets of admissible actions or controls for players 1 and 2, respectively,and the set

K := {(x, a, b) : x ∈ X, a ∈ A(x), b ∈ B(x)}

Dow

nloa

ded

08/0

5/13

to 1

30.2

36.8

3.21

1. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 3: Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

ASYMPTOTIC OPTIMALITY FOR ADAPTIVE MARKOV GAMES 1407

of admissible state-action triplets is a Borel subset of the Cartesian product X ×A × B.

The dynamics of the system is modeled by a measurable function

F : K × Rk → X

as in (1). We assume that the perturbation process {ξn} is formed by independent andidentically distributed Rk-valued random vectors, for some fixed nonnegative integerk, with common density ρ(·) which is unknown to both players. Moreover, we supposethat the realizations ξ0, ξ1, . . . of the disturbance process and the states x0, x1, . . . arecompletely observable.

Finally, the payoff function r(·, ·, ·) is a measurable function on K.Our concern is in a game played over an infinite horizon evolving as follows: at

each time n ∈ N0, the players observe the state of the system xn = x ∈ X; next,possibly taking into account the history of the game, players 1 and 2 get estimatesρ1

n and ρ2n of the unknown density ρ, respectively, and adapt independently their

strategies to choose actions a = an(ρ1n) ∈ A(x) and b = bn(ρ2

n) ∈ B(x), respectively.As a consequence, player 2 pays the amount r(x, a, b) to player 1 and the system visitsa new state xn+1 = x′ ∈ X according to the evolution law

Q(D|x, a, b) :=∫

Rk

1D[F (x, a, b, s)]ρ(s)ds, D ∈ B(X),

where 1D(·) denotes the indicator function of the set D. The goal of player 1 (player2, resp.) is to maximize (minimize, resp.) his/her reward flow (cost flow, resp.)

r(x0, a0, b0), r(x1, a1, b1), . . .

over an infinite horizon using a discounted expected reward (cost, resp.) criteriondefined in section 3.

Strategies. Let H0 := X and Hn := K × Rk ×Hn−1 for n ∈ N. Then, for each

n ∈ N0, a generic element of Hn is denoted as

hn := (x0, a0, b0, ξ1, . . . , xn−1, an−1, bn−1, ξn−1, xn),

which can be thought of as the history of the game up to the nth transition.Thus, a strategy for player 1 is a sequence π1 = {π1

n} of stochastic kernels π1n on

A given Hn satisfying the constraint

π1n(A(xn)|hn) = 1 ∀hn ∈ Hn, n ∈ N0.

The class of all strategies for player 1 is denoted by Π1.For each x ∈ X, let A(x) := P(A(x)) and B(x) := P(B(x)). Denote by Φ1 the

class of all stochastic kernels ϕ1 on A given X such that ϕ1(·|x) ∈ A(x) for allx ∈ X. Similarly, Φ2 is the class of all stochastic kernels ϕ2 on B given X such thatϕ2(·|x) ∈ B(x) for all x ∈ X.

A strategy π1 ∈ Π1 is called stationary if

π1n(·|hn) = ϕ1(·|xn) ∀hn ∈ Hn, n ∈ N0,

for some stochastic kernel ϕ1 in Φ1. Following a standard convention, we identify Φ1

with the class of stationary strategies for player 1. The sets Π2 and Φ2 of all strategiesand all stationary strategies, respectively, for player 2 are defined in a similar way.

Dow

nloa

ded

08/0

5/13

to 1

30.2

36.8

3.21

1. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 4: Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

1408 J. A. MINJAREZ-SOSA AND O. VEGA-AMAYA

Let (Ω,F) be the measurable space that consists of the sample space Ω :=(K × R

k)∞ and its product σ-algebra F . Then, each pair of strategies (π1, π2) ∈Π1×Π2 and initial state x0 = x ∈ X induce in a canonical way a probability measureP π1,π2

x on (Ω,F) which governs the evolution of the stochastic process {(xn, an, bn, ξn)}.The expectation operator with respect to the probability measure P π1,π2

x is denotedby Eπ1,π2

x . The “marginal distribution” of the process {ξn} is denoted by P and thecorresponding expectation operator by E.

Throughout the paper we shall use the following notation: for a measurable func-tion u on K and a stationary strategy pair (ϕ1, ϕ2) ∈ Φ1 × Φ2, let

(2) u(x, ϕ1, ϕ2) :=∫

B(x)

∫A(x)

u(x, a, b)ϕ1(da|x)ϕ2(db|x) ∀x ∈ X.

Thus, in particular, for x ∈ X and s ∈ Rk, we shall write

r(x, ϕ1, ϕ2) :=∫

B(x)

∫A(x)

r(x, a, b)ϕ1(da|x)ϕ2(db|x),

u(F (x, ϕ1, ϕ2, s)) :=∫

B(x)

∫A(x)

u(F (x, a, b, s))ϕ1(da|x)ϕ2(db|x).

We impose the following “compactness/continuity” conditions on the game model.Assumption 1.

(a) The mapping x → A(x) is lower semicontinuous and A(x) is complete forevery x ∈ X.

(b) The mapping x → B(x) is upper semicontinuous and B(x) is compact forevery x ∈ X.

(c) The function r(·, ·, ·) ≥ 0 is lower semicontinuous on K.(d) The mapping

(3) (x, a, b) →∫

Rk

v(F (x, a, b, s))ρ(s)ds

is continuous on K for every bounded continuous function v on X.Remark 2. Note that Assumption 1(d) implies that the mapping in (3) is lower

semicontinuous whenever v(·) is in L(X).We close this section stating a “minimax” measurable selection theorem—borrowed

from Kuenle [13]—as well as some of its direct consequences.Theorem 3. Suppose Assumptions 1(a)–(b) hold. If w ∈ L(K), then

w∗(x) := infϕ2∈B(x)

supϕ1∈A(x)

w(x, ϕ1, ϕ2) = supϕ1∈A(x)

infϕ2∈B(x)

w(x, ϕ1, ϕ2),

and w∗(·) is in L(X). Moreover, there exists ϕ2∗ ∈ Φ2 such that

w∗(x) = supϕ1∈A(x)

w(x, ϕ1, ϕ2∗) ∀x ∈ X,

and for every ε > 0, there exists ϕ1ε ∈ Φ1 such that

w∗(x) − ε ≤ infϕ2∈B(x)

w(x, ϕ1ε , ϕ

2) ∀x ∈ X.

Dow

nloa

ded

08/0

5/13

to 1

30.2

36.8

3.21

1. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 5: Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

ASYMPTOTIC OPTIMALITY FOR ADAPTIVE MARKOV GAMES 1409

Thus, provided that Assumption 1 holds, Theorem 3 implies that the dynamicprogramming operator(4)

Tv(x) := infϕ2∈B(x)

supϕ1∈A(x)

[r(x, ϕ1, ϕ2) + α

∫Rk

v(F (x, ϕ1, ϕ2, s))ρ(s)ds], x ∈ X,

maps the space L(X) into itself for each discount factor α ∈ (0, 1). In addition, theinterchange of inf and sup in (4) holds:

(5) Tv(x) = supϕ1∈A(x)

infϕ2∈B(x)

[r(x, ϕ1, ϕ2) + α

∫Rk

v(F (x, ϕ1, ϕ2, s))ρ(s)ds]

for all state x ∈ X and function v ∈ L(X). Moreover, for each ε > 0, there existstrategies ϕ1

ε ∈ Φ1, ϕ2∗ ∈ Φ2 such that

Tv(x) = supϕ1∈A(x)

[r(x, ϕ1, ϕ2

∗) + α

∫Rk

v(F (x, ϕ1, ϕ2∗, s))ρ(s)ds

]∀x ∈ X,

Tv(x) − ε ≤ infϕ2∈B(x)

[r(x, ϕ1

ε , ϕ2) + α

∫Rk

v(F (x, ϕ1ε , ϕ

2, s))ρ(s)ds]

∀x ∈ X.

3. Discounted optimal strategies. Throughout the paper the discount factorα ∈ (0, 1) is fixed. Thus, the expected α-discounted payoff for the pair of strategies(π1, π2) ∈ Π1 × Π2, given the initial state x0 = x ∈ X, is defined by

V (π1, π2, x) := Eπ1,π2

x

+∞∑n=0

αnr(xn, an, bn), x ∈ X.

Then, the upper value and the lower value of the game at the initial state x ∈ X aregiven as

U(x) := infπ2∈Π2

supπ1∈Π1

V (π1, π2, x) and L(x) := supπ1∈Π1

infπ2∈Π2

V (π1, π2, x),

respectively. Note that, in general, U(·) ≥ L(·); thus, if it holds that U(·) = L(·), thecommon function is called the value of the game and it is denoted by V ∗(·).

If the game has a value V ∗(·), a strategy π1∗ ∈ Π1 is said to be ε-optimal for player1 for ε ≥ 0 if

V ∗(·) − ε ≤ infπ2∈Π2

V (π1∗, π

2, ·).

Similarly, a strategy π2∗ ∈ Π2 is said to be ε-optimal for player 2 if

V ∗(·) + ε ≥ supπ1∈Π1

V (π1, π2∗ , ·).

We call optimal a strategy which is 0-optimal. A pair of optimal strategies (π1∗ , π

2∗)

is called an optimal pair or saddle point. Note that (π1∗, π

2∗) is an optimal pair if and

only if

V (π1, π2∗, ·) ≤ V (π1

∗ , π2∗, ·) ≤ V (π1

∗ , π2, ·) ∀π1 ∈ Π1, π2 ∈ Π2.

We next introduce a growth condition which allows us to handle the unboundedpayoff case.

Assumption 4. There exist a measurable function W ≥ 1 on X and positiveconstants β < 1, M , and b such that the following inequalities hold for all (x, a, b) ∈ K:

Dow

nloa

ded

08/0

5/13

to 1

30.2

36.8

3.21

1. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 6: Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

1410 J. A. MINJAREZ-SOSA AND O. VEGA-AMAYA

(a) 0 ≤ r(x, a, b) ≤MW (x);(b)

∫RkW (F (x, a, b, s))ρ(s)ds ≤ βW (x) + b.

From Van Nunen and Wessels [23], we have that Assumption 4, together withAssumption 1, implies that the dynamic programming operator has a unique fixedpoint in a certain measurable function space. To state this fact precisely we thenintroduce some definitions and notation.

For each function u ∈ L(X) define the W -weighted norm, W -norm for short, as

||u||W := supx∈X

|u(x)|W (x)

,

and denote by LW (X) the class of functions in L(X) with finite W -norm. It is easyto check that LW (X) is a Banach space. Now let θ ∈ (α, 1) and d := b[θ/α− 1]−1 befixed constants, and define

W (·) := W (·) + d.

Now consider the functions u ∈ L(X) with finite W -norm, that is,

||u||W

= supx∈X

|u(x)|W (x)

<∞,

and observe that this norm is equivalent to the W -norm since

||u||W

≤ ||u||W ≤ (1 + d)||u||W.

Thus, in particular, u is LW (X) if and only if it has finite W -norm. Moreover, directcomputations yield

α

∫Rk

W (F (x, a, b, s))ρ(s)ds ≤ θW (x) ∀(x, a, b) ∈ K,

which implies that T is a contraction from LW (X) into itself with modulus θ withrespect to the W -norm.

All these facts are summarized in the following remark.Remark 5 (cf. Van Nunen and Wessels [23]). Suppose that Assumptions 1 and 4

hold, and let θ ∈ (α, 1) and d := b[θ/α− 1]−1 be fixed constants. Then the followinghold:

(a) T is a contraction operator from (LW (X), || · ||W

) into itself with modulus θ.(b) Thus, by the Banach fixed point theorem, T has a unique fixed point V ∗ ∈

LW (X), i.e.,

(6) TV ∗ = V ∗

and

||T nu− V ∗||W

→ 0 ∀u ∈ LW (X).

(c) Hence, since || · ||W and || · ||W

are equivalent norms on LW (X),

||T nu− V ∗||W → 0 ∀u ∈ LW (X).

The next theorem shows the existence of optimal strategies for the nonadaptivediscounted stochastic game. Similar results have been previously obtained under

Dow

nloa

ded

08/0

5/13

to 1

30.2

36.8

3.21

1. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 7: Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

ASYMPTOTIC OPTIMALITY FOR ADAPTIVE MARKOV GAMES 1411

analogous but different settings (see, for instance, [5, 12, 15]). Its proof follows fromTheorem 3, Remark 5, and standard dynamic programming arguments.

Theorem 6. Suppose that Assumptions 1 and 4 hold. Then we have the follow-ing:

(a) The function V ∗ in Remark 5 is the value of the game.(b) There exists a stationary strategy ϕ2

∗ ∈ Φ2 such that

V ∗(x) = supϕ1∈A(x)

[r(x, ϕ1, ϕ2

∗) + α

∫Rk

V ∗[F (x, ϕ1, ϕ2∗, s)]ρ(s)ds

]∀x ∈ X;

moreover, for each ε > 0 there exists a strategy ϕ1ε ∈ Φ1 such that

V ∗(x)−ε ≤ infϕ2∈B(x)

[r(x, ϕ1

ε , ϕ2) + α

∫Rk

V ∗[F (x, ϕ1ε, ϕ

2, s)]ρ(s)ds]

∀x ∈ X.

(c) Thus, ϕ1ε is ε-optimal for player 1 and ϕ2

∗ is optimal for player 2, that is,

V ∗(·) − ε ≤ infπ2∈Π2

V (ϕ1ε, π

2, ·),

V ∗(·) = supπ1∈Π1

V (π1, ϕ2∗, ·).

4. Adaptive strategies. As we already mentioned in the introduction, our mainconcern is in the case when none of the players knows the density ρ(·), so the solutionto the game given in Theorem 6 is not accessible to them. In fact, they have touse adaptive strategies ; that is, they need to combine statistical density estimationprocedures and the decision processes to gain some insights on the evolution of thegame. Obviously, as the discounted criterion depends strongly on the decision selectedat the first stages (precisely when the information about the density ρ is deficient)such processes yield in the best case suboptimal strategies in the long term. Thus, aweaker optimality criterion is required.

In the present paper, we shall use an asymptotic optimality criterion; specifically,we adapt to stochastic games the notion of asymptotic optimality introduced by Schalin [20] (see also [8]) to study adaptive Markov control processes under the discountedcriterion. To do this, define the discrepancy function as

D(x, a, b) := r(x, a, b) + α

∫Rk

V ∗[F (x, a, b, s)]ρ(s)ds− V ∗(x)

for all (x, a, b) ∈ K. From (4), (5), and Theorem 6(a), the relation (6) is equivalent to

supϕ1∈A(x)

infϕ2∈B(x)

D(x, ϕ1, ϕ2) = infϕ2∈B(x)

supϕ1∈A(x)

D(x, ϕ1, ϕ2) = 0.

Moreover, from Theorem 6(b), the pair of strategies (ϕ1ε, ϕ

2∗) satisfies, for all x ∈ X,

D(x, ϕ1ε, ϕ

2) ≥ −ε ∀ϕ2 ∈ B(x),

D(x, ϕ1, ϕ2∗) ≤ 0 ∀ϕ1 ∈ A(x).(7)

These facts motivate the following definition.

Dow

nloa

ded

08/0

5/13

to 1

30.2

36.8

3.21

1. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 8: Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

1412 J. A. MINJAREZ-SOSA AND O. VEGA-AMAYA

Definition 7. A strategy π1∗ ∈ Π1 is said to be asymptotically discounted (AD)-

optimal for player 1 if

lim infn→∞ E

π1∗,π2

x D(xn, an, bn) ≥ 0 ∀x ∈ X, π2 ∈ Π2.

Similarly, π2∗ ∈ Π2 is said to be AD-optimal for player 2 if

lim supn→∞

Eπ1,π2

∗x D(xn, an, bn) ≤ 0 ∀x ∈ X, π1 ∈ Π1.

A pair of AD-optimal strategies is called an AD-optimal pair. In this case, if(π1∗ , π2∗) is an AD-optimal pair, then, for each x ∈ X,

limn→∞E

π1∗,π2

∗x D(xn, an, bn) = 0 ∀x ∈ X.

In order to show the existence of adaptive strategies to both players we need toimpose some conditions on the density ρ(·) and on the dynamic of the game (1).

Assumption 8. There exists a measurable function ρ on Rk such that the densityρ(·) satisfies the following conditions:

(a) ρ(·) ≤ ρ(·) a.e. with respect to the Lebesgue measure.(b) For each s ∈ Rk, F (·, ·, ·, s) is continuous.(c) Moreover, there exist constants λ0 ∈ (0, 1), b0 ≥ 0, p > 1, and M > 0 such

that for all (x, a, b) ∈ K it holds that

0 ≤ r(x, a, b) ≤MW (x),

∫Rk

W p[F (x, a, b, s)]ρ(ds) ≤ λ0Wp(x) + b0.

Remark 9.

(a) Assumption 8(b) implies that the mapping

(8) (x, a, b) →∫

Rk

v(F (x, a, b, s))μ(ds)

is continuous for each bounded continuous function v on X and each prob-ability measure μ(·) on Rk (see, for instance, [9]). Thus, if v(·) belongs toL(X), then the mapping (8) is in L(K).

(b) By Jensen’s inequality, Assumption 8(c) implies Assumption 4(b) with β :=λ

1/p0 and b := b

1/p0 , that is,

(9)∫

Rk

W (F (x, a, b, s))ρ(s)ds ≤ βW (x) + b.

Then, we have that Assumptions 8(b)–(c) imply Assumption 4.(c) Additionally, note that Assumption 8(c) yields

(10) supn≥0

Eπ1,π2

x [W p(xn)] <∞ and supn≥0

Eπ1,π2

x [W (xn)] <∞

for each pair (π1, π2) ∈ Π1 × Π2 and x ∈ X (see, e.g., [7, 10]).

Dow

nloa

ded

08/0

5/13

to 1

30.2

36.8

3.21

1. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 9: Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

ASYMPTOTIC OPTIMALITY FOR ADAPTIVE MARKOV GAMES 1413

Density estimation. We show the existence of estimators of ρ(·) with suitableproperties for the construction of adaptive strategies. To state this precisely, letξ0, ξ1, . . . , ξn, . . . be independent realizations of random vectors with density ρ(·), andlet ρi

n(s) := ρin(s; ξ0, ξ1, . . . , ξn−1), s ∈ Rk, n ∈ N, be density estimators for players

i = 1 and 2, such that

(11) E||ρin − ρ||1 = E

∫Rk

∣∣ρin(s) − ρ(s)

∣∣ ds → 0 as n→ ∞.

A wide class of estimators that satisfies this condition are the kernel estimates

ρin(s) =

1nhk

n

n∑i=1

K

(s− ξihn

),

where the kernel K is a nonnegative measurable function with∫

Rk Kds = 1, hn → 0,and nhk

n → ∞ as n goes to infinity (see, e.g., [4, Chapter 9, Theorem 9.2]).Now, we consider the class D formed by the densities σ(·) satisfying the following

conditions:D.1. σ(·) ≤ ρ(·) a.e. with respect to the Lebesgue measure;D.2.

∫RkW [F (x, a, b, s)]σ(s)ds ≤ βW (x) + b for all (x, a, b) ∈ K, where the con-

stants β and b are as in Remark 9(b).The class of densities D is a closed convex subset of L1 under Assumption 10 (see

[7]).Assumption 10. The function

(12) ψ(s) := supx∈X

supa∈A(x)

supb∈B(x)

1W (x)

W [F (x, a, b, s)], s ∈ Rk,

is finite and satisfies the condition

(13)∫

Rk

ψ2(s)ρ(s)ds <∞.

Note that the function ψ in (12) is upper semianalytic and thus universally measur-able. Hence, the integral in (13) is well-defined (see [3, Chapter 7, section 7.6]).

Then, from [4, Exercise 15.4, p. 169], for each n ∈ N there exists ρin(·) ∈ D such

that

(14) ||ρin − ρi

n||1 = infσ∈D

||σ − ρin||1, i = 1, 2.

Notice that E||ρin − ρ||1 → 0 since

||ρin − ρ||1 ≤ ||ρi

n − ρin||1 + ||ρi

n − ρ||1 ≤ 2||ρin − ρ||1 ∀n ∈ N.

The densities {ρin(·)} will be used to obtain adaptive strategies for each player,

but we first express their convergence property using a more suitable “norm” definedas follows. For a measurable function σ : Rk → R define

||σ|| := supx∈X

supa∈A(x),b∈B(x)

1W (x)

∫Rk

W [F (x, a, b, s)]|σ(s)|ds.

Note that condition D.2 guarantees that ||σ|| <∞ for any density σ in D.The proof of the next result is given in section 6.Lemma 11. If Assumptions 8 and 10 hold, then

E||ρin − ρ|| → 0 as n→ ∞ for i = 1, 2.

Dow

nloa

ded

08/0

5/13

to 1

30.2

36.8

3.21

1. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 10: Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

1414 J. A. MINJAREZ-SOSA AND O. VEGA-AMAYA

Construction of strategies. The adaptive strategies will be obtained using anonstationary value iteration approach when the players use the density estimators{ρi

n} as the true density, so we introduce, for i = 1, 2, the operators

T inv(x) := inf

ϕ2∈B(x)sup

ϕ1∈A(x)

[r(x, ϕ1, ϕ2) + α

∫Rk

v(F (x, ϕ1, ϕ2, s)

)ρi

n(s)ds]

for x ∈ X, n ∈ N. Note that these operators are contractions from (LW (X), || · ||W

)into itself, with modulus θ—see condition D.2 and Remark 5(a)—provided that As-sumptions 1(a)–(c) and 8 hold. Thus, in this case, from Theorem 3, the “minimax”value iteration functions for player i (= 1, 2)

U in := T i

nUin−1, n ∈ N, U i

0 ≡ 0,

belong to the space LW (X). Moreover, there exists a sequence of stationary strategies{λ∗n} ⊂ Φ2 for player 2 such that

(15) U2n(x) = sup

ϕ1∈A(x)

[r(x, ϕ1, λ∗n) + α

∫Rk

U2n−1

(F (x, ϕ1, λ∗n, s)

)ρ2

n(s)ds]

for all x ∈ X, n ∈ N. Similarly, for any sequence of positive numbers {εn} convergingto zero there exists a sequence of stationary strategies {μ∗

n} ⊂ Φ1 such that

(16) U1n(x) − εn ≤ inf

ϕ2∈B(x)

[r(x, μ∗

n, ϕ2) + α

∫Rk

U1n−1

(F (x, μ∗

n, ϕ2, s)

)ρ2

n(s)ds]

for all x ∈ X, n ∈ N. Furthermore, a straightforward calculation shows that, for somepositive constants C1 and C2,

(17)∣∣U i

n(x)∣∣ ≤ CiW (x) ∀n ∈ N0, x ∈ X, i = 1, 2.

Finally, we have the main result which is proved in section 6.Theorem 12. Suppose that Assumptions 1(a)–(c), 8, and 10 hold. Then the

strategies γ1 = {μ∗n} and γ2 = {λ∗n} are AD-optimal for players 1 and 2, respectively.

Thus, in particular,

limn→∞Eγ1,γ2

x D(xn, an, bn) = 0 ∀x ∈ X.

5. A game model for reservoir operation. The optimal operation of reser-voir systems has been widely studied for a long time, but its analysis has turned out tobe an exceptional task because of the complex relationships of economic, social, andhydrological factors involved, among others (see, e.g., [14, 22, 24, 25]). A reservoirsystem usually has two or more purposes (such as water supply for irrigation, do-mestic and industrial use, hydropower generation, water quality improvement, floodcontrol, etc.) which in some cases may be in conflict. For instance, the decisionmaker might face the problem of allocating the water provision to meet the demandfor domestic use on one hand and the demand for some economic activity on the other(say, hydropower generation or land irrigation). The conflicting situation in this casemay come from the asymmetry of the revenues associated with these purposes: theeconomic activity is much more profitable, economically speaking of course, than thedomestic use of the water; but the unsatisfied demand for the latter has a highersocial impact than the former.

Dow

nloa

ded

08/0

5/13

to 1

30.2

36.8

3.21

1. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 11: Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

ASYMPTOTIC OPTIMALITY FOR ADAPTIVE MARKOV GAMES 1415

Usually, the conflicting situations are modeled as optimal control problems withconstraints to hedge the system against unsatisfied demand for one or several of thepurposes [24, 25]. However, such a formulation has some disadvantages. On one hand,there may be some temptation to deviate from the optimal solution since it prescribesthe water release and allocation that optimizes the overall system performance, not theoptimal release for each purpose. On the other hand, it supposes a well-defined prioritybetween the purposes, which in general is very difficult or impossible to establish [21].

A game model formulation provides an alternative way to overcome the weaknessof the control problem described above. Here we study a single reservoir system withinfinite capacity and two purposes modeled as a zero-sum game, in the sense that thewater used for one purpose can be considered as water lost for the other.

Thus, we model the reservoir as a zero-sum game in the following way. The inflowshappen at nonnegative random times Tn, n ∈ N0, with T0 := 0. Let ξ1,n be the inflowat time Tn, n ∈ N, and assume it is a nonnegative random variable. At each time Tn

the decision maker observes the stored water volume xn and chooses the consumptionrate an ∈ A := [0, a∗] for purpose 1 and the consumption rate bn ∈ B : = [b∗, b∗]for purpose 2, where a∗, b∗, and b∗ are fixed positive constants. These consumptionrates remain fixed until the next inflow time Tn+1 occurs whenever the water availableat the beginning of the period has not been depleted. If this is the case, the totalwithdrawal during the period (Tn, Tn+1] is (an + bn)ξ2,n, where ξ2,n := Tn+1 − Tn.When the storage process reaches the zero volume it continues there until a positiveinflow arrives. Then, the storage process {xn} evolves on the nonnegative real numberset X = [0,∞) according to the recursive equation

xn+1 = [xn + ξ1,n − (an + bn)ξ2,n]+, n ∈ N0,

where x0 = x ∈ X is the initial water volume. Thus, we have that

(18) F (x, a, b, s) = [x+ s1 − (a+ b)s2]+

for (x, a, b) ∈ K and s = (s1, s2) ∈ R2+ := {(s1, s2) : s1 ≥ 0, s2 ≥ 0}. Notice that

Assumption 8(b) holds since the function (18) is continuous in (x, a, b).We suppose that the common density ρ(·) of the vectors (ξ1,n, ξ2,n), n ∈ N0,

satisfies Assumptions 8(a) with

ρ(s1, s2) := exp[l(s1 + s2)], (s1, s2) ∈ B,(19)ρ(s1, s2) := 0 otherwise,(20)

where l is a large enough positive constant and B is a measurable bounded subset ofRk. We also assume that mean values of the inflow and the interarrival times satisfythe inequality

(21) Eξ1,n < b∗Eξ2,n.

Then, for q ≥ 0 define the function

Φ(q) := E exp[q(ξ1,n − b∗ξ2,n)]

=∫∫

R2+

exp[q(s1 − b∗s2)]ρ(s1, s2)ds1ds2

Dow

nloa

ded

08/0

5/13

to 1

30.2

36.8

3.21

1. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 12: Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

1416 J. A. MINJAREZ-SOSA AND O. VEGA-AMAYA

and observe that the derivative Φ′(0) = Eξ1,n − b∗Eξ2,n is negative because of thecondition (21). Thus, since Φ(0) = 1, there exists q0 ∈ (0, 1) such that λ0 := Φ(q0) <1. Next, take the weight function as

W (x) := b0 exp(q0x/2), x ∈ X =[0,∞),

where b0 is a positive constant, and the one-stage payoff r as any lower semicontinuousfunction satisfying the condition

0 ≤ r(x, a, b) ≤MW (x) ∀(x, a, b) ∈ K

for some positive constant M as in Assumption 8(c). Then, direct computations yield∫∫R2W 2[F (x, a, b, s1, s2)]ρ(s1, s2)ds1ds2 ≤ λ2

0W2(x) + b0 ∀(x, a, b) ∈ K,

which verify that the second inequality in Assumption 8(c) holds with p = 2.To verify Assumption 10, note that

ψ(s1, s2) = supx∈X

supa∈A

supb∈B

1W (x)

W [F (x, a, b, s1, s2)]

≤ L exp[q0(s1 − b∗s2)/2] ≤ L exp(q0s1/2) <∞

for all (s1, s2) ∈ R2+, where L is a suitable positive constant. Then,∫∫

R2+

ψ2(s1, s2)ρ(s1, s2)ds1ds2 ≤∫∫

R2+

L exp(q0s1/2)ρ(s1, s2)ds1ds2 <∞,

since the function in the second integral is continuous and vanishes outside thebounded measurable set B.

Finally, note that Assumptions 1(a)–(b) trivially hold.

6. Proofs. For the proof of Theorem 12, in addition to Lemma 11, we needseveral other preliminary results which are collected in Lemmas 13–16. Throughoutthis section we suppose that all assumptions in Theorem 12 hold. Thus, we beginwith the proof of Lemma 11.

Proof of Lemma 11. From the Cauchy–Schwarz inequality, (12), and (13),

||ρin − ρ|| ≤

∫Rk

ψ(s)∣∣ρi

n(s) − ρ(s)∣∣ ds

≤(∫

Rk

ψ2(s)∣∣ρi

n(s) − ρ(s)∣∣ ds)1/2 (∫

Rk

∣∣ρin(s) − ρ(s)

∣∣ ds)1/2

≤ 21/2

(∫Rk

ψ2(s)ρ(s)ds)1/2 (∫

Rk

∣∣ρin(s) − ρ(s)

∣∣ ds)1/2

≤ C′(∫

Rk

∣∣ρin(s) − ρ(s)

∣∣ ds)1/2

= C′||ρin − ρ||1,(22)

where C′ :=(2

∫Rk ψ

2(s)ρ(s)ds)1/2. The result follows from the latter inequality since

E||ρin − ρ||1 → 0 as n→ ∞.

Dow

nloa

ded

08/0

5/13

to 1

30.2

36.8

3.21

1. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 13: Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

ASYMPTOTIC OPTIMALITY FOR ADAPTIVE MARKOV GAMES 1417

Lemma 13. For each i = 1, 2,

(23)∥∥T i

nV∗ − TV ∗∥∥

W≤ α ‖V ∗‖W

∥∥ρin − ρ

∥∥ ∀n ∈ N.

Thus, for each (π1, π2) ∈ Π1 × Π2 and x ∈ X,

(24) limn→∞Eπ1,π2

x

∥∥T inV

∗ − TV ∗∥∥W

= 0.

Proof. Note that the second statement follows from the first one and Lemma 11,after observing that

∥∥ρin − ρ

∥∥ does not depend on (π1, π2) ∈ Π1 × Π2 and x ∈ X.Thus we proceed to prove (23). To do this, note that for all x ∈ X, n ∈ N, andi = 1, 2, we have∣∣T i

nV∗(x) − TV ∗(x)

∣∣

≤ supϕ2∈B(x)

supϕ1∈A(x)

∣∣∣∣α∫

Rk

V ∗ (F (x, ϕ1, ϕ2, s)

)ρi

n(s)ds

− α

∫Rk

V ∗ (F (x, ϕ1, ϕ2, s)

)ρ(s)ds

∣∣∣∣

≤ supϕ2∈B(x)

supϕ1∈A(x)

α

∫Rk

∣∣V ∗ (F (x, ϕ1, ϕ2, s)

)∣∣ ∣∣ρin(s) − ρ(s)

∣∣ ds

≤ α ‖V ∗‖W supϕ2∈A(x)

supϕ1∈A(x)

∫Rk

W(F (x, ϕ1, ϕ2, s)

) ∣∣ρin(s) − ρ(s)

∣∣ ds

≤ α ‖V ∗‖W

∥∥ρin − ρ

∥∥ W (x).

Hence, (23) holds.Lemma 14. The following holds:

limn→∞Eπ1,π2

x

∥∥U in − V ∗∥∥

W= 0

for each i = 1, 2, (π1, π2) ∈ Π1 × Π2, and x ∈ X.Proof. First note that for all n ∈ N, and i = 1, 2,∥∥U i

n − V ∗∥∥W

=∥∥T i

nUin−1 − TV ∗∥∥

W

≤ ∥∥T inU

in−1 − T i

nV∗∥∥

W+

∥∥T inV

∗ − TV ∗∥∥W.

Then, since T in is a contraction operator with modulus θ with respect to the W -

weighted norm, it follows from Lemma 13 that

(25)∥∥U i

n − V ∗∥∥W

≤ θ∥∥U i

n−1 − V ∗∥∥W

+ α ‖V ∗‖W

∥∥ρin − ρ

∥∥ .Moreover, since E

∥∥ρin − ρ

∥∥ → 0 there exists a positive constant M ′ such that

Eπ1,π2

x

∥∥U in − V ∗∥∥

W≤ θEπ1,π2

x

∥∥U in−1 − V ∗∥∥

W+M ′.

Dow

nloa

ded

08/0

5/13

to 1

30.2

36.8

3.21

1. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 14: Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

1418 J. A. MINJAREZ-SOSA AND O. VEGA-AMAYA

Thus, by iterations of this inequality we obtain

Eπ1,π2

x

∥∥U in − V ∗∥∥

W≤ θn ‖V ∗‖

W+M ′

n−1∑k=0

θk

≤ M ′ + ‖V ∗‖W

1 − θ,

which in turn implies that

L := lim supn→∞

Eπ1,π2

x

∥∥U in − V ∗∥∥

W<∞.

Now taking expectation on (25) and then limsup as n tends to infinity, we have that0 ≤ L ≤ θL, which in turn yields that L = 0, thus proving the desired result.

In the following we shall use the “approximate discrepancy functions” for playeri (= 1, 2) defined as

Din(x, ϕ1, ϕ2) := r(x, ϕ1, ϕ2) + α

∫Rk

U in−1(F (x, ϕ1, ϕ2, s))ρi

n(s)ds− U in(x)

for x ∈ X, ϕ1 ∈ A(x), ϕ2 ∈ B(x).Lemma 15. For all x ∈ X, n ∈ N, it holds that

supϕ1∈A(x)

supϕ2∈B(x)

∣∣D(x, ϕ1, ϕ2) −Din(x, ϕ1, ϕ2)

∣∣ ≤W (x)ηin,

where

(26) ηin :=

∥∥U in − V ∗∥∥

W+ (β + b)

∥∥U in−1 − V ∗∥∥

W+ α ‖V ∗‖W

∥∥ρin − ρ

∥∥for all n ∈ N.

Proof. Let x ∈ X, ϕ1 ∈ A(x), ϕ2 ∈ B(x), and n ∈ N be fixed but arbitrary, andwrite Ri(x, ϕ1, ϕ2) :=

∣∣D(x, ϕ1, ϕ2) −Din(x, ϕ1, ϕ2)

∣∣. Then, observe that

Ri(x, ϕ1, ϕ2) ≤ ∣∣U in(x) − V ∗(x)

∣∣

+ α

∣∣∣∣∫

Rk

U in−1(F (x, ϕ1, ϕ2, s))ρi

n(s) −∫

Rk

V ∗(F (x, ϕ1, ϕ2, s))ρ(s)∣∣∣∣

≤ ∣∣U in(x) − V ∗(x)

∣∣ + α

∫Rk

V ∗(F (x, ϕ1, ϕ2, s))∣∣ρ(s) − ρi

n(s)∣∣ ds

+ α

∫Rk

∣∣U in−1(F (x, ϕ1, ϕ2, s)) − V ∗(F (x, ϕ1, ϕ2, s))

∣∣ ρin(s)ds

≤ ∣∣U in(x) − V ∗(x)

∣∣ + α ‖V ∗‖W

∥∥ρin − ρ

∥∥W (x)

+∥∥U i

n−1 − V ∗∥∥W

∫Rk

W (F (x, ϕ1, ϕ2, s))ρin(s)ds.

Dow

nloa

ded

08/0

5/13

to 1

30.2

36.8

3.21

1. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 15: Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

ASYMPTOTIC OPTIMALITY FOR ADAPTIVE MARKOV GAMES 1419

Now, since ρin(·) ∈ D, from condition D.2, we have∫

Rk

W [F (x, a, b, s)]ρin(s)ds ≤ βW (x) + b

≤ [β + b]W (x),

which implies

Ri(x, ϕ1, ϕ2) ≤ ∣∣U in(x) − V ∗(x)

∣∣ + α ‖V ∗‖W

∥∥ρin − ρ

∥∥WW (x)

+∥∥U i

n−1 − V ∗∥∥W

(β + b)W (x).

Hence,

supϕ1∈A(x)

supϕ2∈B(x)

∣∣D(x, ϕ1, ϕ2) −Din(x, ϕ1, ϕ2)

∣∣ ≤W (x)ηin ∀x ∈ X, n ∈ N.

Lemma 16. Let the strategies γ1 = {μ∗n} and γ2 = {λ∗n} be as in (16) and (15).

Then

−εn −W (xn)η1n ≤ D(x, μ∗

n, ϕ2) ∀ϕ2 ∈ B(x), n ∈ N,

D(x, ϕ1, λ∗n) ≤W (xn)η2n ∀ϕ1 ∈ A(x), n ∈ N.

Proof. The inequalities follow directly from Lemma 15, noting that

infϕ2∈B(x)

D1n(x, μ∗

n, ϕ2) ≥ −εn,

supϕ1∈A(x)

D2n(x, ϕ1, λ∗n) = 0

hold for all x ∈ X, n ∈ N.Remark 17. Observe that from (26), and from Lemmas 11 and 14,

limn→∞Eπ1,π2

x ηin = 0 for i = 1, 2, (π1, π2) ∈ Π1 × Π2, and x ∈ X.

Thus,

ηin

P π1,π2x→ 0 for i = 1, 2.

Moreover, since ||σ|| <∞ for σ in D, from (17) we have

ki := supnηi

n <∞, i = 1, 2.

Finally we are ready to prove Theorem 12.Proof of Theorem 12. From Lemma 16 it is enough to prove for i = 1, 2 that

Eπ1,π2

x W (xn)ηin → 0 ∀x ∈ X, π1 ∈ Π1, π2 ∈ Π2.

Dow

nloa

ded

08/0

5/13

to 1

30.2

36.8

3.21

1. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 16: Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

1420 J. A. MINJAREZ-SOSA AND O. VEGA-AMAYA

To prove this fact, we begin proving that the process {W (xn)ηin} converges to zero in

probability with respect to P π1,π2

x for all x ∈ X, π1 ∈ Π1, π2 ∈ Π2. To do this, let l1and l2 be arbitrary positive constants; then observe that

P π1,π2

x

[W (xn)ηi

n > l1] ≤ P π1,π2

x

[ηi

n >l1l2

]+ P π1,π2

x [W (xn) > l2] ∀x ∈ X, n ∈ N.

Thus, Chebyshev’s inequality and (10) yield

P π1,π2

x

[W (xn)ηi

n > l1] ≤ P π1,π2

x

[ηi

n >l1l2

]+

1l2Eπ1,π2

x W (xn)

≤ P π1,π2

x

[ηi

n >l1l2

]+M

l2

for some constant M <∞. Hence, from Remark 17,

lim supn→∞

P π1,π2

x

[W (xn)ηi

n > l1] ≤ M

l2.

Since l2 is arbitrary, we have that

limn→∞P π1,π2

x

[W (xn)ηi

n > l1]

= 0,

which proves the claim.On the other hand, from Remarks 9(c) and 17, we see that the inequality

supn∈N

Eπ1,π2

x

[W (xn)ηi

n

]p ≤ kpi sup

n∈N

Eπ1,π2

x W p(xn) <∞

holds for all x ∈ X, π1 ∈ Π1, π2 ∈ Π2. Thus, from [2, Lemma 7.6.9, p. 301], the latterinequality implies that the process {W (xn)ηi

n} is P π1,π2

x -uniformly integrable.Finally, using the uniform integrability of the process {W (xn)ηi

n} and that itconverges to zero, we conclude that

Eπ1,π2

x W (xn)ηin → 0.

Acknowledgments. The authors are grateful to the reviewers for many usefulcomments made on the paper. In particular, they thank the reviewers for pointingout several mistakes and providing [19] and comments on the measurability of thefunction in Assumption 10. They also thank the reviewer for bringing Devroye andLugosi’s book [4] to their attention.

REFERENCES

[1] E. Altman and A. Shwartz, Adaptive control of constrained Markov chains: Criteria andpolicies, Ann. Oper. Res., 28 (1991), pp. 101–134.

[2] R. B Ash, Real Analysis and Probability, Academic Press, New York, 1972.[3] D. P. Bertsekas and S. E. Shreve, Stochastic Optimal Control: The Discrete Time Case,

Academic Press, New York, 1978.[4] L. Devroye and G. Lugosi, Combinatorial Methods in Density Estimation, Springer-Verlag,

New York, 2001.

Dow

nloa

ded

08/0

5/13

to 1

30.2

36.8

3.21

1. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 17: Asymptotically Optimal Strategies for Adaptive Zero-Sum Discounted Markov Games

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

ASYMPTOTIC OPTIMALITY FOR ADAPTIVE MARKOV GAMES 1421

[5] J. I. Gonzalez-Trejo, O. Hernandez-Lerma, and L. F. Hoyos-Reyes, Minimax control ofdiscrete-time stochastic systems, SIAM J. Control Optim., 41 (2003), pp. 1626–1659.

[6] E. I. Gordienko, Adaptive strategies for certain classes of controlled Markov processes, TheoryProbab. Appl., 29 (1985), pp. 504–518.

[7] E. I. Gordienko and J. A. Minjarez-Sosa, Adaptive control for discrete-time Markov pro-cesses with unbounded costs: Discounted criterion, Kybernetika, 34 (1998), pp. 217–234.

[8] O. Hernandez-Lerma, Adaptive Markov Control Processes, Springer-Verlag, New York, 1989.[9] O. Hernandez-Lerma and R. Cavazos-Cadena, Density estimation and adaptive control of

Markov processes: Average and discounted criteria, Acta Appl. Math., 20 (1990), pp.285–307.

[10] O. Hernandez-Lerma and J. B. Lasserre, Further Topics on Discrete-Time Markov ControlProcesses, Springer-Verlag, New York, 1999.

[11] N. Hilgert and J. A. Minjarez-Sosa, Adaptive control of stochastic systems with unknowndisturbance distribution: Discounted criteria, Math. Methods Oper. Res., 63 (2006), pp.443–460.

[12] A. Jaskiewicz and A. S. Nowak, Zero-sum ergodic stochastic games with Feller transitionprobabilities, SIAM J. Control Optim., 45 (2006), pp. 773–789.

[13] H.-U. Kuenle, On Markov games with average reward criterion and weakly continuous tran-sition probabilities, SIAM J. Control Optim., 45 (2007), pp. 2156–2168.

[14] J. D. C. Little, The use of storage water in a hydroelectric system, Oper. Res., 3 (1955), pp.187–197.

[15] F. Luque-Vasquez, Zero-sum semi-Markov games in Borel spaces: Discounted and averagepayoff, Bol. Soc. Mat. Mexicana, 8 (2002), pp. 227–241.

[16] J. A. Minjarez-Sosa, Nonparametric adaptive control for discrete-time Markov processes withunbounded costs under average criterion, Appl. Math. (Warsaw), 26 (1999), pp. 267–280.

[17] J. A. Minjarez-Sosa, Approximation and estimation in Markov control processes under adiscounted criterion, Kybernetika, 40 (2004), pp. 681–690.

[18] K. Najim, A. S. Poznyak, and E. Gomez, Adaptive policy for two finite Markov chains zero-sum stochastic game with unknown transition matrices and average payoffs, AutomaticaJ. IFAC, 37 (2001), pp. 1007–1018.

[19] G. P. Papavassilopoulos, Adaptive games, in Stochastic Processes in Physics and Engineering,S. Albeverio, P. Blanchard, M. Hazewinkel, and L. Streit, eds., Reidel, Dordrecht, TheNetherlands, 1988, pp. 223–236.

[20] M. Schal, Estimation and control in discounted stochastic dynamic programming, Stochastics,20 (1987), pp. 51–71.

[21] S. Simonovic, The implicit stochastic model for reservoir yield optimization, Water Resour.Res., 23 (1987), pp. 2159–2165.

[22] M. J. Sobel, Reservoir management models, Water Resour. Res., 11 (1975), pp. 767–776.[23] J. A. E. E. Van Nunen and J. Wessels, A note on dynamic programming with unbounded

rewards, Management Sci., 24 (1977/78), pp. 576–580.[24] S. Yakowitz, Dynamic programming applications in water resources, Water Resour. Res., 18

(1982), pp. 673–696.[25] W. W-G. Yeh, Reservoir management and operations models: A state-of-the-art review, Water

Resour. Res., 21 (1985), pp. 1797–1818.

Dow

nloa

ded

08/0

5/13

to 1

30.2

36.8

3.21

1. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p


Recommended