Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

transcript

Optimistic Initialization and Greediness

Lead to Polynomial-Time Learning

in Factored MDPsIstván Szita & András LőrinczUniversity of Alberta

CanadaEötvös Loránd University

Hungary

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 2/31

Outline Factored MDPs

motivation definitions planning in FMDPs

Optimism Optimism & FMDPs & Model-based

learning

Reinforcement learning

the agent makes decisions … in an unknown world makes some observations (including

rewards) tries to maximize collected reward

What kind of observation?

structured observations structure is unclear

How to “solve an RL task”? a model is useful

can reuse experience from previous trials can learn offline

observations are structured structure is unknown

structured + model + RL = FMDP ! (or linear dynamical systems, neural networks,

etc…)

Factored MDPs

ordinary MDPs everything is factored

statesrewardstransition probabilities(value functions)

Factored state space

all functions depend on a few variables only

Factored dynamics

Factored rewards

(Factored value functions)

V * is not factored in general we will make an approximation error

Solving a known FMDP NP-hard

either exponential-time or non-optimal…

exponential-time worst case flattening the FMDP approximate policy iteration

[Koller & Parr, 2000, Boutilier, Dearden, Goldszmidt, 2000]

non-optimal solution (approximating value function in a factored form) approximate linear programming

[Guestrin, Koller, Parr & Venkataraman, 2002] ALP + policy iteration [Guestrin et al., 2002] factored value iteration [Szita & Lőrincz, 2008]

Factored value iterationH := matrix of basis functions

N (HT) := row-normalization of HT, the iteration

converges to fixed point w£

can be computed quickly for FMDPs

Let V £ = H w£. Then V £ has bounded error:

Learning in unknown FMDPs

unknown factor decompositions (structure) unknown rewards unknown transitions (dynamics)

Learning in unknown FMDPs

unknown factor decompositions (structure) unknown rewards unknown transitions (dynamics)

Outline

Factored MDPsmotivationdefinitionsplanning in FMDPs

learning

Learning in an unknown FMDPa.k.a. “Explore or exploit?”

after trying a few action sequences… … try to discover better ones? … do the best thing according to

current knowledge?

Be Optimistic!

(when facing uncertainty)

either you get experience…

or you get reward!

Outline

Factored MDPsmotivationdefinitionsplanning in FMDPs

learning

Factored Initial Model

(0,0), a1 - -

(0,0), a2 - -

(0,1), a1 - -

(0,1), a2 - -

(1,0), a1 - -

(1,0), a2 - -

(1,1), a1 - -

(1,1), a2 - -

component x1 parents: (x1,x3)

(0), a1 - -

(0), a2 - -

(1), a1 - -

(1), a2 - -

component x2 parent: (x2)

Factored Optimistic Initial Model

0 1 GOE

(0,0), a1 - - 1

(0,0), a2 - - 1

(0,1), a1 - - 1

(0,1), a2 - - 1

(1,0), a1 - - 1

(1,0), a2 - - 1

(1,1), a1 - - 1

(1,1), a2 - - 1

0 1 GOE

(0), a1 - - 1

(0), a2 - - 1

(1), a1 - - 1

(1), a2 - - 1

“Garden of Eden”+$10000 reward(or something very high)

Later on…

0 1 GOE

(0,0), a1 25 30 1

(0,0), a2 42 12 1

(0,1), a1 3 1 1

(0,1), a2 2 5 1

(1,0), a1 11 9 1

(1,0), a2 2 29 1

(1,1), a1 56 63 1

(1,1), a2 98 - 1

0 1 GOE

(0), a1 42 34 1

(0), a2 25 27 1

(1), a1 7 1 1

(1), a2 3 6 1

according to initial model, all states have valuein frequently visited states, model becomes more realistic ! reward expectations get lower ! agent explores other areas

Factored optimistic initial model initialize model (optimistically) for each time step t,

solve aproximate model using factored value iteration

take greedy action, observe next state update model

number of non-near-optimal steps (w.r.t. V

£ ) is polynomial with probability ¼1

elements of proof: some standard stuff if , then if for all i,

let mi be the number of visits toif mi is large, thenfor all yi.

more precisely:with prob.(Hoeffding/Azuma inequality)

elements of proof: main lemma for any , approximate Bellman-updates

will be more optimistic than the real ones:

if VE is large enough, the bonus term dominates for a long time

if all elements of H are nonnegative, projection preserves optimism

lower bound byAzuma’s inequality bonus promised by

Garden of Eden state

elements of proof: wrap up

for a long time, Vt is optimistic enough to boost exploration

at most polynomially many exploration steps can be made

except those, the agent must be near-V £-optimal

Previous approaches extensions of E3, Rmax, MBIE to FMDPs

using current model, make smart plan (explore or exploit)

explore: make model more accurate exploit: collect near-optimal reward

unspecified planners requirement: output plan is close-to-optimal …e.g., solve the flat MDP

polynomial sample complexity exponential amounts of computation!

Unknown rewards? “To simplify the presentation, we assume

the reward function is known and does not need to be learned. All results can be extended to the case of an unknown reward function.”false.

problem: cannot observe reward components, only their sum! UAI poster [Walsh, Szita, Diuk, Littman,

Unknown structure?

can be learnt in polynomial time SLF-Rmax [Strehl, Diuk, Littman, 2007] Met-Rmax [Diuk, Li, Littman, 2009]

Take-home message

if your model starts out optimistically enough,

you get efficient exploration for free!

(even if your planner is non-optimal (as long as it is monotonic))

Thank you for your attention!

Optimistic initial model for FMDPs add “garden of Eden” value to each state

variable

add reward factors for each state variable

init transition model

Outline

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Documents