Post on 20-Dec-2015
transcript
Dipartimento di Elettronica e Informazione
Multiagent rational decision making: searching and learning for “good” strategies
Enrique Munoz de Cote
What is “good” multiagent learning?
What is “good” multiagent learning2
The prescriptive non-cooperative agenda [Shoham et al. 07]
We are interested in problems where an agent needs to interact in open environments integrated by other agents.
What's a “good” strategy in this situation?
Can the monkey find a “good” strategy?
or does it need to learn?
View: single agent perspective of the multiagent problem.
Environment dependent.
What is “good” multiagent learning3
Multiagent Reinforcement Learning Framework
unknown world:learning
known world:solving
Single-agent Multiple agents
MDPs
matrix gamesDecision Theory,Planning
stochastic games
Dipartimento di Elettronica ed Informazione
What is “good” multiagent learning?
Game theory and multiagent learning: brief backgrounds
Game theory
Stochastic games
Solution concepts
Multiagent learning
Solution concepts
Relation to game theory
What is “good” multiagent learning
Stochastic games (SG)
SGs are good examples of how agents' behaviours depend on each other.
Strategies represent the way agents' behave
Strategies might change as a function of other strategies.
Game theory mathematically captures behaviour in strategic situations.
A B
$$
backgrounds→game theory
What is “good” multiagent learning
A Computational Example: SG version of chicken
actions: U, D, R, L, X
coin flip on collision
Semiwalls (50%)
collision = -5;
step cost = -1;
goal = +100;
discount factor = 0.95;
both can get goal.SG of chicken [Hu & Wellman, 03]
A B
$$
backgrounds→game theory
What is “good” multiagent learning
Strategies on the SG of chicken
A B
$$Average expected reward:
(88.3,43.7);
(43.7,88.3);
(66,66);
(43.7,43.7);
(38.7,38.7);
(83.6,83.6)
backgrounds→game theory
What is “good” multiagent learning
A B
$$
Equilibrium values
Average total reward on equilibrium:
Nash
• (88.3,43.7) very imbalanced, inefficient
• (43.7,88.3) very imbalanced, inefficient
• (53.6,53.6) ½ mix, still inefficient
Correlated
• ([43.7,88.3],[43.7,88.3]);
Minimax
• (43.7,43.7);
Friend
• (38.7,38.7)
backgrounds→game theory
Computationally difficult to find in general
What is “good” multiagent learning
Repeated Games
What if agents are allowed to play multiple times?
Strategies:
• Can be a function of history
• Can be randomized
Nash equilibrium still exists.
What is “good” multiagent learning
Computing strategies for repeated SGs
Complete information: solve
• Exact or approximate solutions
Incomplete information: learn
• The environment (as perceived by the agent) is not Markovian
• Convergence is not guaranteed
− Exceptions: zero-sum and team games
• Unwanted cycles and unpredicted behaviour appear
There are algorithms for solving and learning that use the same successive approximations to the Bellman equations to derive solution policies.
What is “good” multiagent learning
Learning equilibrium strategies in SGs
Multiagent RL updates are based on the Bellman equations (just as RL):
A value iteration (VI) algorithm solve for the optimal Q function
Finding a solution via VI depends on the operator Eq{·}:
How can multiagent RL learn any of those strategies?
Foe-Q: Nash-Q: Nash{·}
CE-Q: CE{·}
Friend-Q: max{·}max min{·}
What is “good” multiagent learning
B
$$
Defining optimality
the safest
the one that minimizes the opponent's reward
the one that maximizes the opponent's reward
the socially stable one
What’s A’s optimal strategy?
In an open environment, an optimal strategy is arguable and may be defined by several criteria.
A
What is “good” multiagent learning
Defining optimality: our criteria
Optimality: should obtain close to maximum utility against other best response algorithms.
Security: should guarantee a minimum lower bound utility.
Simplicity: should be intuitive to understand and implement.
Adaptivity: should learn how to behave optimal, and remain optimal (even if environment changes).
What is “good” multiagent learning
Observation: Reinforcement Learning updates
Q-learning converges to a BR strategy in MDPs
Definition [best response]. A best response function BR(·) returns the set of all strategies that are optimal against the environment's joint strategy.
example environment: only agents with fixed strategies
backgrounds→multiagent RL
observation 2: a learner's BR can be modified by a change in the environment's fixed strategy.
observation 1: a learner's BR is optimal against fixed strategies.
Dipartimento di Elettronica ed Informazione
What is “good” multiagent learning?
Social Rewards
Shaping rewards and intrinsic motivations
Leader and follower strategies Open questions
Joint work with:Monica BabesMichael L. Littman
What is “good” multiagent learning
Social rewards: hints from the brain
We’re smart, but evolution doesn’t trust us to plan all that far ahead.
Evolution programs us to want things likely to bring about what we need:
taste -> nutrition
pleasure -> procreation
eye contact -> care
generosity -> cooperation
Social Rewards→motivations
What is “good” multiagent learning
Is cooperation “pleasurable”?
fMRI study during repeated prisoner’s dilemma showed that humans perceive:
mutual cooperation
“internal rewards”(activity in the brain’s reward center) defection
+
-
Social Rewards→motivations
What is “good” multiagent learning
Social Rewards: its telescoping effect
Objective: change the behavior of the learner by influencing its early experience.
Shaping rewards[Ng et al., 99]
Intrinsic motivation[Singh et al., 04]
Social rewards
Social Rewards→snapshot
What is “good” multiagent learning
Social Rewards: guiding learners to better equilibria
Objective: change the behavior of the learner by influencing its early experience.
Shaping rewards[Ng et al., 99]
Intrinsic motivation[Singh et al., 04]
Social rewards
Social Rewards→snapshot
What is “good” multiagent learning
Leader and follower reasoning [Littman and Stone, 01]
A leader strategy is able to guide a best response learner.
Assumption: the opponent will adapt to its decisions.
A B
$$ In the example A is a leader
and B is a follower
A best response learner is a follower.
Assumption: its behaviour doesn't hurt nobody.
leaders
followers
Social Rewards→introduction
What is “good” multiagent learning
Leader strategies
Assumption: opponent is playing a best response.
-10,-101,-1center
-1,10,0wall
centerwall
Leader fixed strategies
agent Bagent A
Matrix game of chicken.
leader
follower
BRB(wall) = centerRA(wall,center) = -1
BRB(center) = wallRA(center,wall) = 1
Social Rewards→introduction
What is “good” multiagent learning
$$
$$BB$$AA A B
Leader mutual advantage strategies
Easy to say way: compute convex hull.
Easy to compute way:
• Compute attack and defence strategies.
• Compute mutual advantage strategy.
• Use attack strategy as threat to deviations.
the SG version of the prisoner's dilemma [Munoz de Cote and Littman, 2008]
One-shot Nash
Mutual advantage Nash in the repeated game
Social Rewards→introduction
What is “good” multiagent learning
How can a learner be also a leader?
We influence the best response learner's early experience with special shaping rewards called “social rewards”
The learner starts as a leader
If opponent is not a BR follower, the social shaping is washed away.
Social Rewards→methodology
What is “good” multiagent learning
Shaping Based on Potentials
Idea: each state is assigned a potential Φ(s) [Ng et al, 1999],
On each transition, utility is augmented with the difference in potential,
Social Rewards→methodology
What is “good” multiagent learning26
The Q+shaping algorithm
Compute attack and defence strategies.
Compute mutual advantage strategy
• For repeated matrix games use [Littman and Stone,2003] algorithm
• For repeated stochastic games use [Munoz de Cote and Littman, 2008] algorithm
Compute the state values (potentials) for the mutual advantage strategy.
Initialize the Q-table with the potential based function F(s,s’).
• The attack strategy as threat to deviations will teach BR learners better mutual advantage strategies.
Theorem [Wiewiora 03]: shaping based on potentials has the same effect as initializing the Q function with the potential values.
Q+shaping's main objective is to lead or follow, as appropriate
Social Rewards→algorithm
Dipartimento di Elettronica ed Informazione
What is “good” multiagent learning?
A Polynomial-time Nash Equilibrium Algorithm for Repeated Stochastic Games
Joint work with:Michael L. Littman
What is “good” multiagent learning28
Main Result
Given a repeated stochastic game, return a strategy profile that is a Nash equilibrium (specifically one whose payo s match the ffegalitarian point) of the average payo repeated ffstochastic game in polynomial time.
Concretely, we address the following computational problem:
v1
v2
Convex hull of the average payoffs
egalitarian line
Repeated SG Nash algorithm→result
What is “good” multiagent learning29
How? (the short story version)
Compute minimax (security) strategies.
• Solve two linear programming problems.
The algorithm searches for a point:
egalitarian line
v2
2v1
where
P
Convex hull of a hypothetical SG
P is the point with the highest egalitarian value.
Repeated SG Nash algorithm→result
What is “good” multiagent learning30
How? (the search for point P)
egalitarian line
v2
2v1
P=R
Convex hull of a hypothetical SG
Repeated SG Nash algorithm→result
Compute R=friend1, L=friend2 and attack1, attack2strategies
Find egalitarian point and its policy
• If R is left of egalitarian line: P=R
• elseIf L is right of egalitarian line: P = L
• Else egalSearch(R,L,T)
L
R
P=L
R
L
folkfolkEgal(U1,U2, ε)
What is “good” multiagent learning31
Complexity
The algorithm involves solving MDPs (polynomial time) and other steps that also take polynomial time.
• The algorithm is polynomial iff T is bounded by a polynomial.
Result:
Running time. Polynomial in:The discount factor (1 / (1 – γ) );The approximation factor (1 /ε)
Running time. Polynomial in:The discount factor (1 / (1 – γ) );The approximation factor (1 /ε)
Repeated SG Nash algorithm→result
What is “good” multiagent learning32
SG version of the PD game
$$
$$BB$$AA A B
Algorithm
Agent A
Agent B
security-VI
46.5 46.5 mutual defection
friend-VI
46 46 mutual defection
CE-VI 46.5 46.5 mutual defection
folkEgal
88.8 88.8 mutual cooperatio with threat of defection
Repeated SG Nash algorithm→experiments
What is “good” multiagent learning33
Compromise game
A B
$$BB $$AA
Algorithm
Agent A
Agent B
security-VI
0 0 attacker blocking goal
friend-VI
-20 -20 mutual defection
CE-VI 68.2 70.1 suboptimal waiting strategy
folkEgal
78.7 78.7 mutual cooperaton (w=0.5) with treat of defection
Repeated SG Nash algorithm→experiments
What is “good” multiagent learning34
Asymmetric game
A B$$BB$$AA $$AA
Algorithm
Agent A
Agent B
security-VI
0 0 attacker blocking goal
friend-VI
-200 -200 mutual defection
CE-VI 32.1 32.1 suboptimal mutual cooperation
folkEgal
37.2 37.2 mutual cooperaton with threat of defection
Repeated SG Nash algorithm→experiments
Dipartimento di Elettronica e Informazione
Thanks for your attention!