Reinforcement Learning in Partially ObservableMultiagent Settings:
Monte Carlo Exploring Policies
Presenter: Roi CerenTHINC Lab, University of Georgia
Prashant DoshiTHINC Lab, University of Georgia
Bikramjit BanerjeeUniversity of Southern [email protected]
Introduction
Model-‐free reinforcement learning in multiagent systems is a nascent fieldMonte Carlo Exploring Starts for POMDPs is a powerful single-‐agent RL technique
Policy iteration leveraging Q-‐learning to hill-‐climb through the local policy space to local optimaAllows PAC bounds to select sample complexity with confidence
IntroductionWe extend MCES-‐P to the non-‐cooperative multiagentsetting and introduce MCES for Interactive POMDPs
Explicitly models the opponentPredicates action-‐values on expected opponent behaviorWhen instantiated with PAC, trades off computational expense of modeling with lower sample bound complexity
We additionally provide a policy space pruning mechanism to promote scalability
Parametrically bounds regret from avoiding policiesPrioritizes eliminating low-‐regret policy transformations
Background: Multiagent Decision Process
In the multiagent setting, all agents affect the state and the reward for each agent
Agent i
(Joint)Rewards
Action
Reward
Physical State
R(s,ai,aj)
Agent j
Action
Reward
Action Action
Background: I-POMDPThe Interactive POMDP (I-‐POMDP) (Gmytrasiewicz and Doshi 2005)<IS,A,T,Ω,O,R>
Non-‐cooperative: Agents get individual, potentially competitive rewardsActions A, state transitions T, observations 𝛀, observation probabilities O, and rewards RIS: Interactive state, combining the physical state and a model of the other agent
Significant uncertaintyMust reason not only the physical state, but also the opponent’s motivations and beliefs
Background: MCES-P Template
Monte Carlo Exploring Starts for POMDPs (MCES-‐P)(Perkins -‐ AAAI 2002)
General templateExplore neighborhood of 𝜋 -‐ all policies that differ by a single action 𝑎 on some observation sequence 𝑜Compute expected value by simulating policies onlineHill climb to policies with better valuesTerminate if no neighbor is better than the current policy
Background: MCES-P TemplateTransformation
a2
a1 a3
a3 a1 a2 a2
o1
o1 o1
o2
o2 o2
Pick random observation sequence and replace with a random action
a2
a1 a3
a3 a3 a2 a2
o1
o1 o1
o2
o2 o2
{o1,o2}: a1à a3
Background: MCES-P TemplateTransformation
Pick random observation sequence and replace with a random action
𝜋 𝜋'{o1,o2}: a1ßà a3
Background: MCES-P TemplateTransformation
Pick random observation sequence and replace with a random action
𝜋 𝜋'{o1,o2}: a1ßà a3
𝜋'{o1,o2}: a1ßà a2
𝜋'
𝜋'
o1: a1ßà a3
o1: a1ßà a2
𝜋' 𝜋'
∅: a2ßà a1 ∅: a2ßà a3
Background: MCES-P TemplateTransformation
Local Neighborhood
Background: MCES-P TemplateSampling
Pick random action and simulate
a3
𝑄*← ,,. ← 1 − 𝛼(𝑚,𝑐,,.) 𝑄*← ,,. + 𝛼 𝑚,𝑐,,. ⋅ 𝑅9,:;<,(𝜏)
Background: MCES-P TemplateSampling
Sample neighborhood k times for each policy
𝜋
𝜋′
𝑄*? > 𝑄* + 𝜖
Background: MCES-P TemplateSampling
Sample neighborhood k times for each policy
𝜋
𝜋′
𝑄*? > 𝑄* + 𝜖
Background: MCES-P TemplateSampling
Sample neighborhood k times for each policy
𝑄*? > 𝑄* + 𝜖
Background: MCES-P TemplateTermination
When all neighbors sampled k times and no neighbor is better
Background: MCESP+PAC
Problem: Choosing a good sample bound kLow values of k increase the chance we make errors when transformingHigh values, though requiring more samples, guarantee we hill-‐climb correctly
Inaccurate Q-‐values Accurate Q-‐values
High Error Probability Low Error Probability
Background: MCESP+PAC
Solution: Pick a k that guarantees some confidence on the accuracy of the Q-‐value
Probably Approximately Correct (PAC) Learning
The probability of the sample average deviating from the true mean by more than variance 𝝐 is bound by error 𝜹
Pr 𝑋G − 𝜇 > 𝜖 ≤ 2 ⋅ exp −2𝑘𝜖Λ
P= 𝛿
Background: MCESP+PAC
With 𝜖 and 𝛿, we calculate required samples to satisfy the error bound
𝑚 is the number of current transformations𝑁 is number of neighbor policies
𝛿a = bcad*d
𝑘a = 2Λ(𝜋)𝜖
Pln2𝑁𝛿a
Λ 𝜋',𝜋 ≜ maxg(𝑄*−𝑄*?) − min
g(𝑄*−𝑄*?) ≤ 2𝑇 𝑅a.i − 𝑅ajk
Λ 𝜋 = max*?∈kmjnop,q *
Λ(𝜋, 𝜋')
Background: MCESP+PAC
We can transform early by modifying 𝜖
Terminate when 𝑘a samples of each neighbor is taken or for all neighbor policies:
𝜖 𝑚, 𝑝, 𝑞 =Λ 𝜋,𝜋'
12𝑝 ln
2 𝑘a − 1 𝑁𝛿a
if 𝑝 = 𝑞 < 𝑘a
𝜖2 if 𝑝 = 𝑞 = 𝑘a∞ otherwise
𝑄,,. < 𝑄,,*(,) + 𝜖 − 𝜖(𝑚, 𝑐,,. , 𝑐,,* , )
Background: MCESP+PAC
Then, with probability 1 − 𝛿1. MCESP+PAC picks transformations that are
always better than the current policy2. MCESP+PAC terminates with a policy that is an 𝜖-‐
local optima• That is, there is no neighbor that is better than the
last policy by more than 𝜖
MCES-P for Multiagent Settings
MCES-‐P can almost be used as is in the multiagentsetting
MCES-‐P has high computational costsLarge neighborhood requiring 𝑘a samples eachMCES for I-‐POMDPs: Explicitly models the opponent and significantly decreases sample requirements
Observations
Public Private
Noisily indicates physical state Noisily indicates other agents’ actions
MCES-IP TemplateMCES-P vs. MCES-IP
MCES-‐P simulation and Q-‐update
MCES-‐IP reasons about which actions the opponent took in the simulation prior to updating
Pick random𝑜 and 𝑎
Simulate 𝜋 ← 𝑜, 𝑎generating 𝜏
Update 𝑄*← ,,.with 𝑅9,:;<,(𝜏)
Pick random𝑜 and 𝑎
Simulate 𝜋 ← 𝑜, 𝑎generating 𝜏
Update 𝑄*← ,,..w
with 𝑅9,:;<,(𝜏)
Update belief overopponent models
Calculate 𝑎x frommost likely models
MCES-IP TemplateModels
MCES-‐IP maintains a set of models of the opponent, where a model = <history, policy tree>
a1
a1 a1
a1 a1 a1 a1
o1
o1 o1
o2
o2 o2
a2
a2 a2
a2 a2 a2 a2
o1
o1 o1
o2
o2 o2
a3
a1 a2
a2 a3 a3 a1
o1
o1 o1
o2
o2 o2
𝒎𝟏 𝒎𝟐 𝒎𝟑
MCES-IP TemplateGenerating 𝑎x
Every round, MCES-‐IP updates the most probable model and selects the most probable action
0
0.2
0.4
m1 m2 m3
t=1 t=2 t=3
MCES-IP TemplateGenerating 𝑎x
Every round, MCES-‐IP updates the most probable model and selects the most probable action
t=1 t=2 t=3
0.00
0.50
1.00
m1 m2 m3
𝒂𝒋𝟎 = 𝟐
𝑜j = 2𝑜 = ∅
0
0.2
0.4
m1 m2 m3
MCES-IP TemplateGenerating 𝑎x
Every round, MCES-‐IP updates the most probable model and selects the most probable action
0.00
0.20
0.40
m1 m2 m3
t=1 t=2 t=3
0.00
0.50
1.00
m1 m2 m3
𝒂𝒋𝟎 = 𝟐 𝒂𝒋𝟏 = 𝟏
𝑜j = 2𝑜 = ∅
𝑜j = 1𝑜 = 1
0.00
0.50
1.00
m1 m2 m3
MCES-IP TemplateGenerating 𝑎x
Every round, MCES-‐IP updates the most probable model and selects the most probable action
0
0.2
0.4
m1 m2 m3
t=1 t=2 t=3
0.00
0.50
1.00
m1 m2 m3
0.00
0.50
1.00
m1 m2 m3
𝒂𝒋𝟎 = 𝟐 𝒂𝒋𝟏 = 𝟏 𝒂𝒋𝟐 = 𝟑
𝒂𝒋 = {𝟐, 𝟏, 𝟑}
0.00
0.50
1.00
m1 m2 m3
𝑜j = 2𝑜 = ∅
𝑜j = 1𝑜 = 1
𝑜j = 1𝑜 = 3
MCES-IP TemplateUpdating Q-values
Update counts and Q-‐values using 𝑎x
So far, MCES-‐IP is more expensive than MCES-‐P
The Q-‐table is now up to 𝐴x�larger!
𝑄*← ,,..w ← 1− 𝛼 𝑚, 𝑐,,.
.w 𝑄*← ,,..w + 𝛼 𝑚, 𝑐,,.
.w ⋅ 𝑅9,:;<,(𝜏)
MCESIP+PACPAC Bounds
MCESIP+PAC has similar PAC bounds to MCESP+PAC
𝑘a = 2Λ.w(𝜋j)
𝜖
P
ln2𝑁𝛿a
𝜖.w 𝑚, 𝑝, 𝑞 =Λ.w 𝜋j , 𝜋j'
12𝑝 ln
2 𝑘a− 1 𝑁𝛿a
if 𝑝 = 𝑞 < 𝑘a
𝜖2 if 𝑝 = 𝑞 = 𝑘a∞ otherwise
MCESIP+PACPAC Bounds
Λ.w modifies the range of possible rewardsSince the opponent action is known, the range of possible rewards may often be narrower
resulting in the following proposition:
𝑎x1 𝑎x2
𝑎j1 0 3
𝑎j2 4 5
Λ.w 𝜋j, 𝜋j' ≤ Λ 𝜋j, 𝜋j'
MCESIP+PACPAC Bounds
MCESIP+PAC terminates when 𝑘a samples of the local neighborhood bears no better policy or for all neighbors 𝜋′
With probability 1 − 𝛿1. MCESIP+PAC picks transformations that are always better than the
current policy2. MCESIP+PAC terminates with a policy that is an 𝜖-‐local optima
𝑄*? < 𝑄* + 𝜖 − 𝜖(𝑚, 𝑐,,., 𝑐,,* , )
Policy Search Space Pruning
Policy Search Space PruningIntroduction
Not all observation sequences occur with the same probability
Low likelihood events are difficult to sample
Pruning: Avoid policy transformations that involve rare observation sequences while considering the impact on reward
Regret: The amount of expected value lost by avoiding simulating on these transformations
Policy Search Space PruningRegret
L
L L
L L L
GL
GL GL
GR
GR GR
L
L L
GL GR
L L
GL GR
L L
GL GR
L L
GL GR
LPr ≈ 6%LPr ≈ 30%
𝑟𝑒𝑔𝑟𝑒𝑡 ≈ 6.6
𝑟𝑒𝑔𝑟𝑒𝑡 ≈ 33
L
L
Policy Search Space Pruning
L
L L
L L L
GL
GL GL
GR
GR GR
L
L L
GL GR
L L
GL GR GL GR
L L
GL GR
L L
Allowableregret
0%
100%
Allowedtransformations
Policy Search Space Pruning
L
L L
L L L
GL
GL GL
GR
GR GR
L
L L
GL GR
L L
GL GR GL GR
L L
GL GR
L L
Allowableregret
0%
100%
Allowedtransformations
Policy Search Space Pruning
L
L L
L L L
GL
GL GL
GR
GR GR
L
L L
GL GR
L L
GL GR GL GR
L L
GL GR
L L
Allowableregret
0%
100%
Allowedtransformations
Policy Search Space Pruning
L
L L
L L L
GL
GL GL
GR
GR GR
L
L L
GL GR
L L
GL GR GL GR
L L
GL GR
L L
Allowableregret
0%
100%
Allowedtransformations
ExperimentsDomains
3 Domains
Multiagent Tiger Problem 3x2 UAV Problem
ExperimentsDomains
3 Domains
Money Laundering (ML) Problem
bank
insurance
offshore
shell companies
casinos
real estate
Placement Layering Integration
ExperimentsDomains
3 Domains
Money Laundering (ML) Problem
bank
insurance
offshore
shell companies
casinos
real estate
Placement Layering Integration
ExperimentsDomain Parameters
Opponent follows a fixed strategySingle: Only one policy is ever usedMixed (Non-‐stationary environment): Randomly selects from 2 to 3 policies every new trajectory
𝝐 𝜹 % 𝒓𝒆𝒈𝒓𝒆𝒕 𝒉𝒐𝒓𝒊𝒛𝒐𝒏Multiagent Tiger 0.05 0.1 15% 3
3x2 UAV 0.1 0.1 20% 3Money Laundering 0.1 0.15 20% 3
ExperimentsComparative Results
Right: 2 runs comparing MCESP+PAC and MCESIP+PACRight-‐top: Mixed strategy opponentRight-‐middle: Single strategy opponent
ExperimentsPruning
Pruning is crucial to tractability
×7.59 ×5.94 ×8.37
Concluding Remarks
Model-‐free RL in multiagent settingsGeneralized from MCES-‐P
MCES-‐IP models the opponent, more sample efficient when paired with PAC bounds
Partiallymodel-‐free
Instantiated with PAC to provide 𝜖-‐local optimality and search space pruning for improved scalability
Thank you!Q & A
Related WorksBayes-‐Adaptive POMDPs (Ross et al. 2007)
Extended to MPOMDPs (Amato and Oliehoek 2013)
Model-‐based RLIMCQ-‐Alt for Dec-‐POMDPs (Banerjee et al. 2013)
Quasi-‐model based – intermediate calculation of model parametersAlternating – each agent must take turns
Bayes-‐Adaptive I-‐POMDPs (Ng et al. 2012)Model-‐based RLPhysical state perfectly observable
Background: Decision Processes
Decision problem: how to optimize behavior to maximize reward?
Choose the action that has the best expected outcome
Agent PreferencesAction
RewardR(a)
Background : Decision Processes
Agent PreferencesAction
Reward
Physical State
R(s,a)
Action
Background : Decision Processes
Agent PreferencesAction
Reward
Physical State
R(s,a)
Action
Background: RL
A popular class of model-‐free RL methods are the temporal difference learning models
Example: Q-‐learning
𝜶: Learning rate𝜸: Discount factor
Computes action-‐values from a state by exploring new values and exploiting previous knowledge
𝑄 𝑠, 𝑎; 𝛼 = 1− 𝛼 𝑄 𝑠, 𝑎 + 𝛼 𝑟 𝑠,𝑎 + 𝛾 ⋅ max.'
𝑄(𝑠′, 𝑎')