Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 213 times |
Download: | 0 times |
Administrivia•Reminder:
•Midterm exam, this Thurs (Oct 20)
•Spec v 0.98 released today (after class)
•Check class web page
Temporal perspective•Last Thursday:
•Fall break. Hope you had a good one!
•Last Tuesday:
•The true meaning of static
•Singleton design pattern
•Today:
✓Astronomy
✓Administrivia
•Static, singletons, & enums (briefly)
•More RL (algorithms!)
Enums•Nothing magical about Java enums
•“Under the hood” they’re just classes + singleton design pattern
Recall: The MDP•Entire RL environment defined by a Markov decision process:
•M= 〈 S,A,T,R 〈
•S: state space
•A: action space
•T: transition function
•R: reward function
Policies•Plan of action is called a policy, π
•Policy defines what action to take in every state of the system:
Policies•Plan of action is called a policy, π
•Policy defines what action to take in every state of the system:
•π(s42)=“FWD” == “When I find myself in state s42, execute the action FORWARD”
Policies•Plan of action is called a policy, π
•Policy defines what action to take in every state of the system:
•π(s42)=“FWD” == “When I find myself in state s42, execute the action FORWARD”
•This is basis for learning: tweak π (indirectly) to make agent better over time
The goal of RL
•Agent’s goal:
•Find the best possible policy: π*
•Find policy, π*, that maximizes Vπ(s) for all s
•Q: What’s the simplest Java implementation of a policy?
Explicit policypublic class MyAgent implements RLAgent {private final Map<State2d,Action> _policy;
public MyAgent() {_policy=new HashMap<State2d,Action>();
}
public Action pickAction(State2d here) {if (_policy.containsKey(here)) {return _policy.get(here);
}// generate a default and add to _policy
}}
Implicit policypublic class MyAgent2 implements RLAgent {
private final Map<State2d,Map<Action,Double>> _policy;public MyAgent2() {
_policy=new HashMap<State2d,HashMap<Action,Double>>();
}
public Action pickAction(State2d here) {if (_policy.containsKey(here)) {
Action maxAct=null; double v=Double.MIN_VALUE;for (Action a : _policy.get(here).keySet()) {
if (_policy.get(here).get(a)>v) {maxAct=a;}
}return maxAct;
}// handle default action case
Q functions•Implicit policy uses the idea of a “Q function”
•Q : S × A → Reals
•For each action at each state, says how good/bad that action is
•If Q(si,a1)>Q(si,a2), then a1 is a “better” action than a2 at state si
•Represented in code with Map<Action,Double>:
•Mapping from an Action to the value (Q) of that Action
“Where should I go now?”
Q(s29,FWD)=2.38
Q(s29,BACK)=1.79
Q(s29,TURNCLOCK)=3.49
Q(s29,TURNCC)=0.74
⇒ “Best thing to do is turn clockwise”
s29
Q(s29,NOOP)=2.03
Q, cont’d•Now we have something that we can learn!
•For a given state, s, and action, a, adjust Q for that <s,a> pair
•If a seems better than _policy currently has recorded, increase Q(s,a)
•If a seems worse than _policy currently has recorded, decrease Q(s,a)
Q learning in math...•Let <s,a,r,s’> be an experience tuple
•Let a’=argmaxg{Q(s’,g)}
•“Best” action at next state, s’
•Q learning rule says: update current Q with a fraction of next state Q value:
•Q(s,a) ← Q(s,a) + α(r+γQ(s’,a’)-Q(s,a))
•0≤α<1 and 0≤γ<1 are constants that change behavior of the algorithm
Q learning in code...public class MyAgent implements Agent {
public void updateModel(SARSTuple s) {State2d start=s.getInitState();State2d end=s.getNextState();Action act=s.getAction();double r=s.getReward();double Qnow=_policy.get(start).get(act);double Qnext=_policy.get(end).findMaxQ();double Qrevised=Qnow+getAlpha()*(r+getGamma()*Qnext-Qnow);
_policy.get(start).put(act,Qrevised);}
}