Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and...

transcript

Dipartimento di Elettronica e Informazione

Multiagent rational decision making: searching and learning for “good” strategies

Enrique Munoz de Cote

What is “good” multiagent learning?

What is “good” multiagent learning2

The prescriptive non-cooperative agenda [Shoham et al. 07]

We are interested in problems where an agent needs to interact in open environments integrated by other agents.

What's a “good” strategy in this situation?

Can the monkey find a “good” strategy?

or does it need to learn?

View: single agent perspective of the multiagent problem.

Environment dependent.

Multiagent Reinforcement Learning Framework

unknown world:learning

known world:solving

Single-agent Multiple agents

matrix gamesDecision Theory,Planning

stochastic games

Dipartimento di Elettronica ed Informazione

Game theory and multiagent learning: brief backgrounds

Game theory

Stochastic games

Solution concepts

Multiagent learning

Solution concepts

Relation to game theory

What is “good” multiagent learning

Stochastic games (SG)

SGs are good examples of how agents' behaviours depend on each other.

Strategies represent the way agents' behave

Strategies might change as a function of other strategies.

Game theory mathematically captures behaviour in strategic situations.

backgrounds→game theory

A Computational Example: SG version of chicken

actions: U, D, R, L, X

coin flip on collision

Semiwalls (50%)

collision = -5;

step cost = -1;

goal = +100;

discount factor = 0.95;

both can get goal.SG of chicken [Hu & Wellman, 03]

Strategies on the SG of chicken

$$Average expected reward:

(88.3,43.7);

(43.7,88.3);

(66,66);

(43.7,43.7);

(38.7,38.7);

(83.6,83.6)

Equilibrium values

Average total reward on equilibrium:

• (88.3,43.7) very imbalanced, inefficient

• (43.7,88.3) very imbalanced, inefficient

• (53.6,53.6) ½ mix, still inefficient

Correlated

• ([43.7,88.3],[43.7,88.3]);

Minimax

• (43.7,43.7);

Friend

• (38.7,38.7)

Computationally difficult to find in general

Repeated Games

What if agents are allowed to play multiple times?

Strategies:

• Can be a function of history

• Can be randomized

Nash equilibrium still exists.

Computing strategies for repeated SGs

Complete information: solve

• Exact or approximate solutions

Incomplete information: learn

• The environment (as perceived by the agent) is not Markovian

• Convergence is not guaranteed

− Exceptions: zero-sum and team games

• Unwanted cycles and unpredicted behaviour appear

There are algorithms for solving and learning that use the same successive approximations to the Bellman equations to derive solution policies.

Learning equilibrium strategies in SGs

Multiagent RL updates are based on the Bellman equations (just as RL):

A value iteration (VI) algorithm solve for the optimal Q function

Finding a solution via VI depends on the operator Eq{·}:

How can multiagent RL learn any of those strategies?

Foe-Q: Nash-Q: Nash{·}

CE-Q: CE{·}

Friend-Q: max{·}max min{·}

Defining optimality

the safest

the one that minimizes the opponent's reward

the one that maximizes the opponent's reward

the socially stable one

What’s A’s optimal strategy?

In an open environment, an optimal strategy is arguable and may be defined by several criteria.

Defining optimality: our criteria

Optimality: should obtain close to maximum utility against other best response algorithms.

Security: should guarantee a minimum lower bound utility.

Simplicity: should be intuitive to understand and implement.

Adaptivity: should learn how to behave optimal, and remain optimal (even if environment changes).

Observation: Reinforcement Learning updates

Q-learning converges to a BR strategy in MDPs

Definition [best response]. A best response function BR(·) returns the set of all strategies that are optimal against the environment's joint strategy.

example environment: only agents with fixed strategies

backgrounds→multiagent RL

observation 2: a learner's BR can be modified by a change in the environment's fixed strategy.

observation 1: a learner's BR is optimal against fixed strategies.

Social Rewards

Shaping rewards and intrinsic motivations

Leader and follower strategies Open questions

Joint work with:Monica BabesMichael L. Littman

Social rewards: hints from the brain

We’re smart, but evolution doesn’t trust us to plan all that far ahead.

Evolution programs us to want things likely to bring about what we need:

taste -> nutrition

pleasure -> procreation

eye contact -> care

generosity -> cooperation

Social Rewards→motivations

Is cooperation “pleasurable”?

fMRI study during repeated prisoner’s dilemma showed that humans perceive:

mutual cooperation

“internal rewards”(activity in the brain’s reward center) defection

Social Rewards→motivations

Social Rewards: its telescoping effect

Objective: change the behavior of the learner by influencing its early experience.

Shaping rewards[Ng et al., 99]

Intrinsic motivation[Singh et al., 04]

Social rewards

Social Rewards→snapshot

Social Rewards: guiding learners to better equilibria

Objective: change the behavior of the learner by influencing its early experience.

Shaping rewards[Ng et al., 99]

Intrinsic motivation[Singh et al., 04]

Social rewards

Social Rewards→snapshot

Leader and follower reasoning [Littman and Stone, 01]

A leader strategy is able to guide a best response learner.

Assumption: the opponent will adapt to its decisions.

$$ In the example A is a leader

and B is a follower

A best response learner is a follower.

Assumption: its behaviour doesn't hurt nobody.

leaders

followers

Social Rewards→introduction

Leader strategies

Assumption: opponent is playing a best response.

-10,-101,-1center

-1,10,0wall

centerwall

Leader fixed strategies

agent Bagent A

Matrix game of chicken.

leader

follower

BRB(wall) = centerRA(wall,center) = -1

BRB(center) = wallRA(center,wall) = 1

$$BB$$AA A B

Leader mutual advantage strategies

Easy to say way: compute convex hull.

Easy to compute way:

• Compute attack and defence strategies.

• Compute mutual advantage strategy.

• Use attack strategy as threat to deviations.

the SG version of the prisoner's dilemma [Munoz de Cote and Littman, 2008]

One-shot Nash

Mutual advantage Nash in the repeated game

How can a learner be also a leader?

We influence the best response learner's early experience with special shaping rewards called “social rewards”

The learner starts as a leader

If opponent is not a BR follower, the social shaping is washed away.

Social Rewards→methodology

Shaping Based on Potentials

Idea: each state is assigned a potential Φ(s) [Ng et al, 1999],

On each transition, utility is augmented with the difference in potential,

Social Rewards→methodology

The Q+shaping algorithm

Compute attack and defence strategies.

Compute mutual advantage strategy

• For repeated matrix games use [Littman and Stone,2003] algorithm

• For repeated stochastic games use [Munoz de Cote and Littman, 2008] algorithm

Compute the state values (potentials) for the mutual advantage strategy.

Initialize the Q-table with the potential based function F(s,s’).

• The attack strategy as threat to deviations will teach BR learners better mutual advantage strategies.

Theorem [Wiewiora 03]: shaping based on potentials has the same effect as initializing the Q function with the potential values.

Q+shaping's main objective is to lead or follow, as appropriate

Social Rewards→algorithm

A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games

Joint work with:Michael L. Littman

Main Result

Given a repeated stochastic game, return a strategy profile that is a Nash equilibrium (specifically one whose payo s match the ffegalitarian point) of the average payo repeated ffstochastic game in polynomial time.

Concretely, we address the following computational problem:

Convex hull of the average payoffs

egalitarian line

Repeated SG Nash algorithm→result

How? (the short story version)

Compute minimax (security) strategies.

• Solve two linear programming problems.

The algorithm searches for a point:

egalitarian line

Convex hull of a hypothetical SG

P is the point with the highest egalitarian value.

How? (the search for point P)

egalitarian line

Convex hull of a hypothetical SG

Compute R=friend1, L=friend2 and attack1, attack2strategies

Find egalitarian point and its policy

• If R is left of egalitarian line: P=R

• elseIf L is right of egalitarian line: P = L

• Else egalSearch(R,L,T)

folkfolkEgal(U1,U2, ε)

Complexity

The algorithm involves solving MDPs (polynomial time) and other steps that also take polynomial time.

• The algorithm is polynomial iff T is bounded by a polynomial.

Result:

Running time. Polynomial in:The discount factor (1 / (1 – γ) );The approximation factor (1 /ε)

SG version of the PD game

$$BB$$AA A B

Algorithm

Agent A

Agent B

security-VI

46.5 46.5 mutual defection

friend-VI

46 46 mutual defection

CE-VI 46.5 46.5 mutual defection

folkEgal

88.8 88.8 mutual cooperatio with threat of defection

Repeated SG Nash algorithm→experiments

Compromise game

$$BB $$AA

Algorithm

Agent A

Agent B

security-VI

0 0 attacker blocking goal

friend-VI

-20 -20 mutual defection

CE-VI 68.2 70.1 suboptimal waiting strategy

folkEgal

78.7 78.7 mutual cooperaton (w=0.5) with treat of defection

Asymmetric game

A B$$BB$$AA $$AA

Algorithm

Agent A

Agent B

security-VI

0 0 attacker blocking goal

friend-VI

-200 -200 mutual defection

CE-VI 32.1 32.1 suboptimal mutual cooperation

folkEgal

37.2 37.2 mutual cooperaton with threat of defection

Dipartimento di Elettronica e Informazione

Thanks for your attention!

Dipartimento di Elettronica e Informazione Multiagent rational decision making: searching and...

Documents