Online Knowledge Enhancements for Monte Carlo Tree Search ... · Online Knowledge Enhancements for...

transcript

Online Knowledge Enhancements for Monte CarloTree Search in Probabilistic PlanningBachelor presentation

Marcel Neidinger <m.neidinger@unibas.ch>

Department of Mathematics and Computer Science,University of Basel

13. February 2017

What is Probabilistic Planning?

Solve planning tasks with probabilistic transitionsModels a Markov Decision Problem given byM = ⟨V, s0, A, T,R⟩

A set of binary variables V inducing States S = 2V

An initial state s0 ∈ SA set of applicable actions AA transition model T : S ×A× S → [0; 1]A Reward R(s, a)

Monte Carlo Tree Search algorithms solve MDPs

Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 2 / 33

Monte Carlo Tree Search Algorithms

Algorithmic framework to solve MDPsUsed especially in computer Go

Go Board1 Lee Sedol2

1Source: https://commons.wikimedia.org/wiki/File:Go_board.jpg2Source: https://qz.com/639952/googles-ai-won-the-game-go-by-defying-

millennia-of-basic-human-instinct/Online Knowledge Enhancements for Monte Carlo Tree Search in Probabilistic Planning 3 / 33

Four phases - Two components

Selection Expansion Simulation

Simulation

Backpropagation

Monte Carlo Tree node

MCTS tree for a MDP M

Important information in a tree nodeA state s ∈ SA counter N (i) for the number of visitsA counter N (i)(s, a)∀a ∈ A for the number of times a was selectedin sA reward estimate Q(i)(s, a) for action a in state s

Online Knowledge

AlphaGo used Neural Networks for the two policis →Domain-specific knowledgeWe want domain independent enhancements

Overview

Tree-Policy EnhancementsAll Moves as First

α-AMAFCutoff-AMAF

Rapid Action Value Estimation

Default-Policy EnhancementsMove-Average Sampling Technique

Conclusion

What is a Tree Policy?

Iterate through the known part of the tree and select an actiongiven a nodeUse a Q value for a state-action pair to estimate an actionsreward

MCTS implementation first proposed in 2006

m m′m′

m′m′′

Reward: 10

Reward approximation, parent node vl, child node vj

UCT (vl, vj) = Q(i)(sl, aj) + 2Cp

√2 lnN (i)(sl)

N (i+1)(sj)(1)

From parent vl select child node v∗ that maximises

v∗ = maxvj

{UCT (nl, nj)} (2)

All Moves as First - Idea

UCT score needs several trials to become reliableIdea: Generalize informations extracted from trialsImplementation: Use additional (node-independant) scorethat updates unselected actions as well

m m′m′

m′m′′

Reward: 10

State Action Rewards1 m …

All Moves as First - α-AMAF

Idea: Combine UCT and AMAF score

SCR = αAMAF + (1− α)UCT (3)

Choose action with highest SCR

All Moves as First - α-AMAF - Results

wildfire

triangle

academic

elevators

tamarisk

sysadmin

recongame

trafficcrossing

skillnavigation

Domain

AMAF(α = 0)AMAF(α = 0.2)AMAF(α = 0.4)

AMAF(α = 0.6)AMAF(α = 0.8)AMAF(α = 1.0)

All Moves as First - α-AMAF - Problems

With more trials UCT becomes more reliableAMAF score has higher variance

We want to discontinue using AMAF score aftersome time

All Moves as First - α-AMAF - Problems

With more trials UCT becomes more reliableAMAF score has higher variance

We want to discontinue using AMAF score aftersome time

All Moves as First - Cutoff-AMAF

Introduce cutoff parameter K

{αAMAF + (1− α)UCT , for i ≤ k

UCT ,else(4)

Use AMAF score only in the first k trials

All Moves as First - Cutoff-AMAF - Results

0 10 20 30 40 50

TotalIPPCscore

K value

init: IDS, backup: MCRaw UCT

Plain α-AMAF

All Moves as First - Cutoff-AMAF - Problems

How to choose the parameter K?When is the UCT score reliable enough?

Rapid Actio Value Estimation - Idea

First introduced in 2007 for computer goUse soft cutoff

α = max

V − v(n)

Use UCT for often visited nodes and AMAF score forless-visited

Rapid Action Value Estimation - Results

wildfire

triangle

academic

elevators

tamarisk

sysadmin

recongame

trafficcrossing

skillnavigation

Domain

UCTRAVE(5)

RAVE(15)RAVE(25)

RAVE(50)

All Moves as First - Conclusion

wildfire

triangle

academic

elevators

tamarisk

sysadmin

recongame

trafficcrossing

skillnavigation

Domain

UCTRAVE(25)

AMAF(α = 0.2)

Rapid Action Value Estimation - Problems

PROST uses problem description with conditional effectsAlso no preconditions givenPROST description is more general

PlayerGoal fieldMovepath

In PROST:

Action: move_upIn e.g. computer chess

Action: move_a2_to_a3

Predicate Rapid Action Value Estimation

A state has predicates that give some contextIdea Use predicates to find similar states and use their score

QPRAV E(s, a) =1

∑p∈P

QRAV E(p, a) (6)

and weight with

V − v(n)

All Moves as First - Conclusion - Revisited

wildfire

triangle

academic

elevators

tamarisk

sysadmin

recongame

trafficcrossing

skillnavigation

Domain

UCTPRAVE

RAVE(25)AMAF(α = 0.2)

Overview

α-AMAFCutoff-AMAF

Conclusion

What is a Default Policy?

Simulation

Simulate the outcome of a trialBasic default policy: random walk

X-Average Sampling Technique

Use tree knowledge to bias default policy towards moves thatare more goal-oriented

Move-Average Sampling Technique - Idea -Sample Game

PlayerGoal fieldMovepath

Introduce Q(a)

Use moves that aregood on averageChoose actionaccording to:

P (a) =e

Q(a)τ∑

b∈Ae

Q(b)τ

Move-Average Sampling Technique - Idea -Example

Actions: r,r,u,u,uQ(r) = 1;N(r) = 2Q(u) = 6;N(u) = 3

Actions: r,r,u,l,lQ(r) = 2;N(r) = 4Q(u) = 7;N(u) = 4Q(l) = 3;N(l) = 2

Move-Average Sampling Technique - Idea -Example (2)

Actions: l,u,u,r,rQ(r) = 7;N(r) = 6Q(u) = 8;N(u) = 6Q(l) = 2;N(l) = 3

Actions: r,r,r,u,uQ(r) = 7;N(r) = 9Q(u) = 9;N(u) = 8Q(l) = 2;N(l) = 3

Move-Average Sampling Technique - Results

wildfire

triangle

academic

elevators

tamarisk

sysadmin

recongam

etraffic

crossing

skillnavigation

IPPCscore

Domain

UCT(RandomWalk) UCT(MAST)

Overview

α-AMAFCutoff-AMAF

Conclusion

Tree-policy enhancementsα-AMAF and RAVE performe worse than standard UCTPRAVE performs slightly better but still worse than standard UCT

Default-policy enhancementsMAST outperforms RandomWalk

Questions?

m.neidinger@unibas.ch

Online Knowledge Enhancements for Monte Carlo Tree Search ... · Online Knowledge Enhancements for...

Documents