Online Knowledge Enhancements for Monte Carlo Tree Search ... · Online Knowledge Enhancements for...

Post on 20-Feb-2019

218 views 0 download

transcript

Online Knowledge Enhancements for Monte CarloTree Search in Probabilistic PlanningBachelor presentation

Marcel Neidinger <m.neidinger@unibas.ch>

Department of Mathematics and Computer Science,University of Basel

13. February 2017

What is Probabilistic Planning?

Solve planning tasks with probabilistic transitionsModels a Markov Decision Problem given byM = ⟨V, s0, A, T,R⟩

A set of binary variables V inducing States S = 2V

An initial state s0 ∈ SA set of applicable actions AA transition model T : S ×A× S → [0; 1]A Reward R(s, a)

Monte Carlo Tree Search algorithms solve MDPs

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 2 / 33

Monte Carlo Tree Search Algorithms

Algorithmic framework to solve MDPsUsed especially in computer Go

Go Board1 Lee Sedol2

1Source: https://commons.wikimedia.org/wiki/File:Go_board.jpg2Source: https://qz.com/639952/googles-ai-won-the-game-go-by-defying-

millennia-of-basic-human-instinct/Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 3 / 33

Four phases - Two components

Selection Expansion Simulation

e

Simulation

Backpropagation

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 4 / 33

Monte Carlo Tree node

MCTS tree for a MDP M

Important information in a tree nodeA state s ∈ SA counter N (i) for the number of visitsA counter N (i)(s, a)∀a ∈ A for the number of times a was selectedin sA reward estimate Q(i)(s, a) for action a in state s

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 5 / 33

Online Knowledge

AlphaGo used Neural Networks for the two policis →Domain-specific knowledgeWe want domain independent enhancements

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 6 / 33

Overview

Tree-Policy EnhancementsAll Moves as First

α-AMAFCutoff-AMAF

Rapid Action Value Estimation

Default-Policy EnhancementsMove-Average Sampling Technique

Conclusion

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 7 / 33

What is a Tree Policy?

Iterate through the known part of the tree and select an actiongiven a nodeUse a Q value for a state-action pair to estimate an actionsreward

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 8 / 33

UCT

MCTS implementation first proposed in 2006

m m′m′

m′m′′

Reward: 10

s1

s2 s3

s4s5

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 9 / 33

UCT

Reward approximation, parent node vl, child node vj

UCT (vl, vj) = Q(i)(sl, aj) + 2Cp

√2 lnN (i)(sl)

N (i+1)(sj)(1)

From parent vl select child node v∗ that maximises

v∗ = maxvj

{UCT (nl, nj)} (2)

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 10 / 33

All Moves as First - Idea

UCT score needs several trials to become reliableIdea: Generalize informations extracted from trialsImplementation: Use additional (node-independant) scorethat updates unselected actions as well

m m′m′

m′m′′

Reward: 10

s1

s2 s3

s4s5

State Action Rewards1 m …

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 11 / 33

All Moves as First - α-AMAF

Idea: Combine UCT and AMAF score

SCR = αAMAF + (1− α)UCT (3)

Choose action with highest SCR

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 12 / 33

All Moves as First - α-AMAF - Results

0

0.2

0.4

0.6

0.8

1

wildfire

triangle

academic

elevators

tamarisk

sysadmin

recongame

trafficcrossing

skillnavigation

total

IPPC

score

Domain

AMAF(α = 0)AMAF(α = 0.2)AMAF(α = 0.4)

AMAF(α = 0.6)AMAF(α = 0.8)AMAF(α = 1.0)

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 13 / 33

All Moves as First - α-AMAF - Problems

With more trials UCT becomes more reliableAMAF score has higher variance

We want to discontinue using AMAF score aftersome time

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 14 / 33

All Moves as First - α-AMAF - Problems

With more trials UCT becomes more reliableAMAF score has higher variance

We want to discontinue using AMAF score aftersome time

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 14 / 33

All Moves as First - Cutoff-AMAF

Introduce cutoff parameter K

SCR =

{αAMAF + (1− α)UCT , for i ≤ k

UCT ,else(4)

Use AMAF score only in the first k trials

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 15 / 33

All Moves as First - Cutoff-AMAF - Results

0.5

0.55

0.6

0.65

0.7

0.75

0 10 20 30 40 50

TotalIPPCscore

K value

init: IDS, backup: MCRaw UCT

Plain α-AMAF

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 16 / 33

All Moves as First - Cutoff-AMAF - Problems

How to choose the parameter K?When is the UCT score reliable enough?

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 17 / 33

Rapid Actio Value Estimation - Idea

First introduced in 2007 for computer goUse soft cutoff

α = max

{0,

V − v(n)

V

}(5)

Use UCT for often visited nodes and AMAF score forless-visited

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 18 / 33

Rapid Action Value Estimation - Results

0

0.2

0.4

0.6

0.8

1

wildfire

triangle

academic

elevators

tamarisk

sysadmin

recongame

trafficcrossing

skillnavigation

total

IPPC

score

Domain

UCTRAVE(5)

RAVE(15)RAVE(25)

RAVE(50)

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 19 / 33

All Moves as First - Conclusion

0

0.2

0.4

0.6

0.8

1

wildfire

triangle

academic

elevators

tamarisk

sysadmin

recongame

trafficcrossing

skillnavigation

total

IPPC

score

Domain

UCTRAVE(25)

AMAF(α = 0.2)

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 20 / 33

Rapid Action Value Estimation - Problems

PROST uses problem description with conditional effectsAlso no preconditions givenPROST description is more general

PlayerGoal fieldMovepath

In PROST:

Action: move_upIn e.g. computer chess

Action: move_a2_to_a3

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 21 / 33

Predicate Rapid Action Value Estimation

A state has predicates that give some contextIdea Use predicates to find similar states and use their score

QPRAV E(s, a) =1

N

∑p∈P

QRAV E(p, a) (6)

and weight with

α =

{0,

V − v(n)

V

}(7)

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 22 / 33

All Moves as First - Conclusion - Revisited

0

0.2

0.4

0.6

0.8

1

wildfire

triangle

academic

elevators

tamarisk

sysadmin

recongame

trafficcrossing

skillnavigation

total

IPPC

score

Domain

UCTPRAVE

RAVE(25)AMAF(α = 0.2)

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 23 / 33

Overview

Tree-Policy EnhancementsAll Moves as First

α-AMAFCutoff-AMAF

Rapid Action Value Estimation

Default-Policy EnhancementsMove-Average Sampling Technique

Conclusion

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 24 / 33

What is a Default Policy?

e

Simulation

Simulate the outcome of a trialBasic default policy: random walk

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 25 / 33

X-Average Sampling Technique

Use tree knowledge to bias default policy towards moves thatare more goal-oriented

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 26 / 33

Move-Average Sampling Technique - Idea -Sample Game

PlayerGoal fieldMovepath

Introduce Q(a)

Use moves that aregood on averageChoose actionaccording to:

P (a) =e

Q(a)τ∑

b∈Ae

Q(b)τ

(8)

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 27 / 33

Move-Average Sampling Technique - Idea -Example

Actions: r,r,u,u,uQ(r) = 1;N(r) = 2Q(u) = 6;N(u) = 3

Actions: r,r,u,l,lQ(r) = 2;N(r) = 4Q(u) = 7;N(u) = 4Q(l) = 3;N(l) = 2

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 28 / 33

Move-Average Sampling Technique - Idea -Example (2)

Actions: l,u,u,r,rQ(r) = 7;N(r) = 6Q(u) = 8;N(u) = 6Q(l) = 2;N(l) = 3

Actions: r,r,r,u,uQ(r) = 7;N(r) = 9Q(u) = 9;N(u) = 8Q(l) = 2;N(l) = 3

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 29 / 33

Move-Average Sampling Technique - Results

0

0.1

0.2

0.3

0.4

0.5

wildfire

triangle

academic

elevators

tamarisk

sysadmin

recongam

etraffic

crossing

skillnavigation

total

IPPCscore

Domain

UCT(RandomWalk) UCT(MAST)

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 30 / 33

Overview

Tree-Policy EnhancementsAll Moves as First

α-AMAFCutoff-AMAF

Rapid Action Value Estimation

Default-Policy EnhancementsMove-Average Sampling Technique

Conclusion

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 31 / 33

Conclusion

Tree-policy enhancementsα-AMAF and RAVE performe worse than standard UCTPRAVE performs slightly better but still worse than standard UCT

Default-policy enhancementsMAST outperforms RandomWalk

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 32 / 33

Questions?

m.neidinger@unibas.ch