Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs

Post on 02-Feb-2016

35 views 0 download

Tags:

description

Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs. Istv án Szita & Andr ás Lőrincz. University of Alberta Canada. Eötvös Loránd University Hungary. Outline. Factored MDPs motivation definitions planning in FMDPs Optimism - PowerPoint PPT Presentation

transcript

Optimistic Initialization and Greediness

Lead to Polynomial-Time Learning

in Factored MDPsIstván Szita & András LőrinczUniversity of Alberta

CanadaEötvös Loránd University

Hungary

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 2/31

Outline Factored MDPs

motivation definitions planning in FMDPs

Optimism Optimism & FMDPs & Model-based

learning

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 3/31

Reinforcement learning

the agent makes decisions … in an unknown world makes some observations (including

rewards) tries to maximize collected reward

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 4/31

What kind of observation?

structured observations structure is unclear

???

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 5/31

How to “solve an RL task”? a model is useful

can reuse experience from previous trials can learn offline

observations are structured structure is unknown

structured + model + RL = FMDP ! (or linear dynamical systems, neural networks,

etc…)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 6/31

Factored MDPs

ordinary MDPs everything is factored

statesrewardstransition probabilities(value functions)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 7/31

Factored state space

all functions depend on a few variables only

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 8/31

Factored dynamics

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 9/31

Factored rewards

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 10/31

(Factored value functions)

V * is not factored in general we will make an approximation error

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 11/31

Solving a known FMDP NP-hard

either exponential-time or non-optimal…

exponential-time worst case flattening the FMDP approximate policy iteration

[Koller & Parr, 2000, Boutilier, Dearden, Goldszmidt, 2000]

non-optimal solution (approximating value function in a factored form) approximate linear programming

[Guestrin, Koller, Parr & Venkataraman, 2002] ALP + policy iteration [Guestrin et al., 2002] factored value iteration [Szita & Lőrincz, 2008]

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 12/31

Factored value iterationH := matrix of basis functions

N (HT) := row-normalization of HT, the iteration

converges to fixed point w£

can be computed quickly for FMDPs

Let V £ = H w£. Then V £ has bounded error:

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 13/31

Learning in unknown FMDPs

unknown factor decompositions (structure) unknown rewards unknown transitions (dynamics)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 14/31

Learning in unknown FMDPs

unknown factor decompositions (structure) unknown rewards unknown transitions (dynamics)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 15/31

Outline

Factored MDPsmotivationdefinitionsplanning in FMDPs

Optimism Optimism & FMDPs & Model-based

learning

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 16/31

Learning in an unknown FMDPa.k.a. “Explore or exploit?”

after trying a few action sequences… … try to discover better ones? … do the best thing according to

current knowledge?

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 17/31

Be Optimistic!

(when facing uncertainty)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 18/31

either you get experience…

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 19/31

or you get reward!

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 20/31

Outline

Factored MDPsmotivationdefinitionsplanning in FMDPs

Optimism Optimism & FMDPs & Model-based

learning

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 21/31

Factored Initial Model

0 1

(0,0), a1 - -

(0,0), a2 - -

(0,1), a1 - -

(0,1), a2 - -

(1,0), a1 - -

(1,0), a2 - -

(1,1), a1 - -

(1,1), a2 - -

component x1 parents: (x1,x3)

0 1

(0), a1 - -

(0), a2 - -

(1), a1 - -

(1), a2 - -

component x2 parent: (x2)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 22/31

Factored Optimistic Initial Model

0 1 GOE

(0,0), a1 - - 1

(0,0), a2 - - 1

(0,1), a1 - - 1

(0,1), a2 - - 1

(1,0), a1 - - 1

(1,0), a2 - - 1

(1,1), a1 - - 1

(1,1), a2 - - 1

component x1 parents: (x1,x3)

0 1 GOE

(0), a1 - - 1

(0), a2 - - 1

(1), a1 - - 1

(1), a2 - - 1

component x2 parent: (x2)

“Garden of Eden”+$10000 reward(or something very high)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 23/31

Later on…

0 1 GOE

(0,0), a1 25 30 1

(0,0), a2 42 12 1

(0,1), a1 3 1 1

(0,1), a2 2 5 1

(1,0), a1 11 9 1

(1,0), a2 2 29 1

(1,1), a1 56 63 1

(1,1), a2 98 - 1

component x1 parents: (x1,x3)

0 1 GOE

(0), a1 42 34 1

(0), a2 25 27 1

(1), a1 7 1 1

(1), a2 3 6 1

component x2 parent: (x2)

according to initial model, all states have valuein frequently visited states, model becomes more realistic ! reward expectations get lower ! agent explores other areas

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 24/31

Factored optimistic initial model initialize model (optimistically) for each time step t,

solve aproximate model using factored value iteration

take greedy action, observe next state update model

number of non-near-optimal steps (w.r.t. V

£ ) is polynomial with probability ¼1

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 25/31

elements of proof: some standard stuff if , then if for all i,

then

let mi be the number of visits toif mi is large, thenfor all yi.

more precisely:with prob.(Hoeffding/Azuma inequality)

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 26/31

elements of proof: main lemma for any , approximate Bellman-updates

will be more optimistic than the real ones:

if VE is large enough, the bonus term dominates for a long time

if all elements of H are nonnegative, projection preserves optimism

lower bound byAzuma’s inequality bonus promised by

Garden of Eden state

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 27/31

elements of proof: wrap up

for a long time, Vt is optimistic enough to boost exploration

at most polynomially many exploration steps can be made

except those, the agent must be near-V £-optimal

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 28/31

Previous approaches extensions of E3, Rmax, MBIE to FMDPs

using current model, make smart plan (explore or exploit)

explore: make model more accurate exploit: collect near-optimal reward

unspecified planners requirement: output plan is close-to-optimal …e.g., solve the flat MDP

polynomial sample complexity exponential amounts of computation!

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 29/31

Unknown rewards? “To simplify the presentation, we assume

the reward function is known and does not need to be learned. All results can be extended to the case of an unknown reward function.”false.

problem: cannot observe reward components, only their sum! UAI poster [Walsh, Szita, Diuk, Littman,

2009]

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 30/31

Unknown structure?

can be learnt in polynomial time SLF-Rmax [Strehl, Diuk, Littman, 2007] Met-Rmax [Diuk, Li, Littman, 2009]

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 31/31

Take-home message

if your model starts out optimistically enough,

you get efficient exploration for free!

(even if your planner is non-optimal (as long as it is monotonic))

Thank you for your attention!

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 33/31

Optimistic initial model for FMDPs add “garden of Eden” value to each state

variable

add reward factors for each state variable

init transition model

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 34/31

Outline

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 35/31

Outline

Szita & Lőrincz: Optimistic Initialization and Greediness Lead to Polynomial-Time Learning in Factored MDPs 36/31

Outline