+ All Categories
Home > Documents > Policy Evaluation & Policy Iteration

Policy Evaluation & Policy Iteration

Date post: 01-Feb-2016
Category:
Upload: korene
View: 40 times
Download: 1 times
Share this document with a friend
Description:
Policy Evaluation & Policy Iteration. S&B: Sec 4.1, 4.3; 6.5. The Bellman equation. The final recursive equation is known as the Bellman equation : Unique soln to this eqn gives value of a fixed policy π when operating in a known MDP M = 〈 S , A ,T,R 〉 - PowerPoint PPT Presentation
Popular Tags:
30
Policy Evaluation & Policy Iteration S&B: Sec 4.1, 4.3; 6.5
Transcript
Page 1: Policy Evaluation & Policy Iteration

Policy Evaluation & Policy Iteration

S&B: Sec 4.1, 4.3; 6.5

Page 2: Policy Evaluation & Policy Iteration

The Bellman equation•The final recursive equation is known as

the Bellman equation:

•Unique soln to this eqn gives value of a fixed policy π when operating in a known MDP M=〈 S,A,T,R〉

•When state/action spaces are discrete, can think of V and R as vectors and Tπ as matrix, and get matrix eqn:

Page 3: Policy Evaluation & Policy Iteration

Exercise•Solve the matrix Bellman equation (i.e.,

find V):

•I formulated the Bellman equations for “state-based” rewards: R(s)

•Formulate & solve the B.E. for “state-action” rewards (R(s,a)) and “state-action-state” rewards (R(s,a,s’))

Page 4: Policy Evaluation & Policy Iteration

Policy values in practice

“Robot” navigation in a grid maze

Page 5: Policy Evaluation & Policy Iteration

Policy values in practice

Optimal policy, π*

Page 6: Policy Evaluation & Policy Iteration

Policy values in practice

Value function for optimal policy, V*

Page 7: Policy Evaluation & Policy Iteration

A harder “maze”...

Page 8: Policy Evaluation & Policy Iteration

A harder “maze”...Optimal policy, π*

Page 9: Policy Evaluation & Policy Iteration

A harder “maze”...Value function for optimal policy, V*

Page 10: Policy Evaluation & Policy Iteration

A harder “maze”...Value function for optimal policy, V*

Page 11: Policy Evaluation & Policy Iteration

Still more complex...

Page 12: Policy Evaluation & Policy Iteration

Still more complex...Optimal policy, π*

Page 13: Policy Evaluation & Policy Iteration

Still more complex...Value function for optimal policy, V*

Page 14: Policy Evaluation & Policy Iteration

Still more complex...Value function for optimal policy, V*

Page 15: Policy Evaluation & Policy Iteration

Planning: finding π*•So we know how to evaluate a single

policy, π

•How do you find the best policy?

•Remember: still assuming that we know M=〈 S,A,T,R〉

Page 16: Policy Evaluation & Policy Iteration

Planning: finding π*•So we know how to evaluate a single

policy, π

•How do you find the best policy?

•Remember: still assuming that we know M=〈 S,A,T,R〉

•Non-solution: iterate through all possible π, evaluating each one; keep best

Page 17: Policy Evaluation & Policy Iteration

Policy iteration & friends•Many different solutions available.

•All exploit some characteristics of MDPs:

•For infinite-horizon discounted reward in a discrete, finite MDP, there exists at least one optimal, stationary policy (may exist more than one equivalent policy)

•The Bellman equation expresses recursive structure of an optimal policy

•Leads to a series of closely related policy solutions: policy iteration, value iteration, generalized policy iteration, etc.

Page 18: Policy Evaluation & Policy Iteration

The policy iteration alg.Function: policy_iteration

Input: MDP M=〈 S,A,T,R〉 , discount γ

Output: optimal policy π*; opt. value func. V*Initialization: choose π

0 arbitrarily

Repeat {Vi=eval_policy(M,π

i,γ) // from Bellman eqn

πi+1=local_update_policy(π

i,V

i)

} Until (πi+1==π

i)

Function: π’=local_update_policy(π,V)for i=1..|S| {π’(s

i)=argmax

a∈A{sumj(T(si,a,s

j)*V(s

j))}

}

Page 19: Policy Evaluation & Policy Iteration

Why does this work?•2 explanations:

•Theoretical:

•The local update w.r.t. the policy value is a contractive mapping, ergo a fixed point exists and will be reached

•See, “contraction mapping”, “Banach fixed-point theorem”, etc.• http://math.arizona.edu/~restrepo/475A/

Notes/sourcea/node22.html

• http://planetmath.org/encyclopedia/BanachFixedPointTheorem.html

•Contracts w.r.t. the Bellman Error:

Page 20: Policy Evaluation & Policy Iteration

Why does this work?•The intuitive explanation

•It’s doing a dynamic-programming “backup” of reward from reward “sources”

•At every step, the policy is locally updated to take advantage of new information about reward that is propagated back by the evaluation step

•Value “propagates away” from sources and the policy is able to say “hey! there’s reward over there! I can get some of that action if I change a bit!”

Page 21: Policy Evaluation & Policy Iteration

P.I. in action

Policy ValueIteration 0

Page 22: Policy Evaluation & Policy Iteration

P.I. in action

Policy ValueIteration 1

Page 23: Policy Evaluation & Policy Iteration

P.I. in action

Policy ValueIteration 2

Page 24: Policy Evaluation & Policy Iteration

P.I. in action

Policy ValueIteration 3

Page 25: Policy Evaluation & Policy Iteration

P.I. in action

Policy ValueIteration 4

Page 26: Policy Evaluation & Policy Iteration

P.I. in action

Policy ValueIteration 5

Page 27: Policy Evaluation & Policy Iteration

P.I. in action

Policy ValueIteration 6: done

Page 28: Policy Evaluation & Policy Iteration

Properties & Variants•Policy iteration

•Known to converge (provable)

•Observed to converge exponentially quickly

•# iterations is O(ln(|S|))

•Empirical observation; strongly believed but no proof (yet)

•O(|S|3) time per iteration (policy evaluation)

•Other methods possible

•Linear program (poly time soln exists)

•Value iteration

•Generalized policy iter. (often best in practice)

Page 29: Policy Evaluation & Policy Iteration

Q: A key operative•Critical step in policy iteration

• π’(si)=argmax

a∈A{sumj(T(si,a,s

j)*V(s

j))}

•Asks “What happens if I ignore π for just one step, and do a instead (and then resume doing π thereafter)?”

•Often used operation. Gets a special name:

•Definition: the Q function, is:

•Policy iter says: “Figure out Q, act greedily according to Q, then update Q and repeat, until you can’t do any better...”

Page 30: Policy Evaluation & Policy Iteration

What to do with Q•Can think of Q as a big table: one entry

for each state/action pair

•“If I’m in state s and take action a, this is my expected discounted reward...”

•A “one-step” exploration: “In state s, if I deviate from my policy π for one timestep, then keep doing π, is my life better or worse?”

•Can get V and π from Q:


Recommended