Online Knowledge Enhancements for Monte CarloTree Search in Probabilistic PlanningBachelor presentation
Marcel Neidinger <[email protected]>
Department of Mathematics and Computer Science,University of Basel
13. February 2017
What is Probabilistic Planning?
Solve planning tasks with probabilistic transitionsModels a Markov Decision Problem given byM = ⟨V, s0, A, T,R⟩
A set of binary variables V inducing States S = 2V
An initial state s0 ∈ SA set of applicable actions AA transition model T : S ×A× S → [0; 1]A Reward R(s, a)
Monte Carlo Tree Search algorithms solve MDPs
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 2 / 33
Monte Carlo Tree Search Algorithms
Algorithmic framework to solve MDPsUsed especially in computer Go
Go Board1 Lee Sedol2
1Source: https://commons.wikimedia.org/wiki/File:Go_board.jpg2Source: https://qz.com/639952/googles-ai-won-the-game-go-by-defying-
millennia-of-basic-human-instinct/Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 3 / 33
Four phases - Two components
Selection Expansion Simulation
e
Simulation
Backpropagation
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 4 / 33
Monte Carlo Tree node
MCTS tree for a MDP M
Important information in a tree nodeA state s ∈ SA counter N (i) for the number of visitsA counter N (i)(s, a)∀a ∈ A for the number of times a was selectedin sA reward estimate Q(i)(s, a) for action a in state s
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 5 / 33
Online Knowledge
AlphaGo used Neural Networks for the two policis →Domain-specific knowledgeWe want domain independent enhancements
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 6 / 33
Overview
Tree-Policy EnhancementsAll Moves as First
α-AMAFCutoff-AMAF
Rapid Action Value Estimation
Default-Policy EnhancementsMove-Average Sampling Technique
Conclusion
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 7 / 33
What is a Tree Policy?
Iterate through the known part of the tree and select an actiongiven a nodeUse a Q value for a state-action pair to estimate an actionsreward
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 8 / 33
UCT
MCTS implementation first proposed in 2006
m m′m′
m′m′′
Reward: 10
s1
s2 s3
s4s5
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 9 / 33
UCT
Reward approximation, parent node vl, child node vj
UCT (vl, vj) = Q(i)(sl, aj) + 2Cp
√2 lnN (i)(sl)
N (i+1)(sj)(1)
From parent vl select child node v∗ that maximises
v∗ = maxvj
{UCT (nl, nj)} (2)
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 10 / 33
All Moves as First - Idea
UCT score needs several trials to become reliableIdea: Generalize informations extracted from trialsImplementation: Use additional (node-independant) scorethat updates unselected actions as well
m m′m′
m′m′′
Reward: 10
s1
s2 s3
s4s5
State Action Rewards1 m …
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 11 / 33
All Moves as First - α-AMAF
Idea: Combine UCT and AMAF score
SCR = αAMAF + (1− α)UCT (3)
Choose action with highest SCR
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 12 / 33
All Moves as First - α-AMAF - Results
0
0.2
0.4
0.6
0.8
1
wildfire
triangle
academic
elevators
tamarisk
sysadmin
recongame
trafficcrossing
skillnavigation
total
IPPC
score
Domain
AMAF(α = 0)AMAF(α = 0.2)AMAF(α = 0.4)
AMAF(α = 0.6)AMAF(α = 0.8)AMAF(α = 1.0)
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 13 / 33
All Moves as First - α-AMAF - Problems
With more trials UCT becomes more reliableAMAF score has higher variance
We want to discontinue using AMAF score aftersome time
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 14 / 33
All Moves as First - α-AMAF - Problems
With more trials UCT becomes more reliableAMAF score has higher variance
We want to discontinue using AMAF score aftersome time
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 14 / 33
All Moves as First - Cutoff-AMAF
Introduce cutoff parameter K
SCR =
{αAMAF + (1− α)UCT , for i ≤ k
UCT ,else(4)
Use AMAF score only in the first k trials
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 15 / 33
All Moves as First - Cutoff-AMAF - Results
0.5
0.55
0.6
0.65
0.7
0.75
0 10 20 30 40 50
TotalIPPCscore
K value
init: IDS, backup: MCRaw UCT
Plain α-AMAF
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 16 / 33
All Moves as First - Cutoff-AMAF - Problems
How to choose the parameter K?When is the UCT score reliable enough?
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 17 / 33
Rapid Actio Value Estimation - Idea
First introduced in 2007 for computer goUse soft cutoff
α = max
{0,
V − v(n)
V
}(5)
Use UCT for often visited nodes and AMAF score forless-visited
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 18 / 33
Rapid Action Value Estimation - Results
0
0.2
0.4
0.6
0.8
1
wildfire
triangle
academic
elevators
tamarisk
sysadmin
recongame
trafficcrossing
skillnavigation
total
IPPC
score
Domain
UCTRAVE(5)
RAVE(15)RAVE(25)
RAVE(50)
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 19 / 33
All Moves as First - Conclusion
0
0.2
0.4
0.6
0.8
1
wildfire
triangle
academic
elevators
tamarisk
sysadmin
recongame
trafficcrossing
skillnavigation
total
IPPC
score
Domain
UCTRAVE(25)
AMAF(α = 0.2)
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 20 / 33
Rapid Action Value Estimation - Problems
PROST uses problem description with conditional effectsAlso no preconditions givenPROST description is more general
PlayerGoal fieldMovepath
In PROST:
Action: move_upIn e.g. computer chess
Action: move_a2_to_a3
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 21 / 33
Predicate Rapid Action Value Estimation
A state has predicates that give some contextIdea Use predicates to find similar states and use their score
QPRAV E(s, a) =1
N
∑p∈P
QRAV E(p, a) (6)
and weight with
α =
{0,
V − v(n)
V
}(7)
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 22 / 33
All Moves as First - Conclusion - Revisited
0
0.2
0.4
0.6
0.8
1
wildfire
triangle
academic
elevators
tamarisk
sysadmin
recongame
trafficcrossing
skillnavigation
total
IPPC
score
Domain
UCTPRAVE
RAVE(25)AMAF(α = 0.2)
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 23 / 33
Overview
Tree-Policy EnhancementsAll Moves as First
α-AMAFCutoff-AMAF
Rapid Action Value Estimation
Default-Policy EnhancementsMove-Average Sampling Technique
Conclusion
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 24 / 33
What is a Default Policy?
e
Simulation
Simulate the outcome of a trialBasic default policy: random walk
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 25 / 33
X-Average Sampling Technique
Use tree knowledge to bias default policy towards moves thatare more goal-oriented
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 26 / 33
Move-Average Sampling Technique - Idea -Sample Game
PlayerGoal fieldMovepath
Introduce Q(a)
Use moves that aregood on averageChoose actionaccording to:
P (a) =e
Q(a)τ∑
b∈Ae
Q(b)τ
(8)
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 27 / 33
Move-Average Sampling Technique - Idea -Example
Actions: r,r,u,u,uQ(r) = 1;N(r) = 2Q(u) = 6;N(u) = 3
Actions: r,r,u,l,lQ(r) = 2;N(r) = 4Q(u) = 7;N(u) = 4Q(l) = 3;N(l) = 2
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 28 / 33
Move-Average Sampling Technique - Idea -Example (2)
Actions: l,u,u,r,rQ(r) = 7;N(r) = 6Q(u) = 8;N(u) = 6Q(l) = 2;N(l) = 3
Actions: r,r,r,u,uQ(r) = 7;N(r) = 9Q(u) = 9;N(u) = 8Q(l) = 2;N(l) = 3
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 29 / 33
Move-Average Sampling Technique - Results
0
0.1
0.2
0.3
0.4
0.5
wildfire
triangle
academic
elevators
tamarisk
sysadmin
recongam
etraffic
crossing
skillnavigation
total
IPPCscore
Domain
UCT(RandomWalk) UCT(MAST)
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 30 / 33
Overview
Tree-Policy EnhancementsAll Moves as First
α-AMAFCutoff-AMAF
Rapid Action Value Estimation
Default-Policy EnhancementsMove-Average Sampling Technique
Conclusion
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 31 / 33
Conclusion
Tree-policy enhancementsα-AMAF and RAVE performe worse than standard UCTPRAVE performs slightly better but still worse than standard UCT
Default-policy enhancementsMAST outperforms RandomWalk
Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 32 / 33