RL: Algorithms time. Happy Full Moon! Administrivia Reminder: Midterm exam, this Thurs (Oct 20) Spec...

RL: Algorithms time

Happy Full

Moon!

Administrivia•Reminder:

•Midterm exam, this Thurs (Oct 20)

•Spec v 0.98 released today (after class)

•Check class web page

Temporal perspective•Last Thursday:

•Fall break. Hope you had a good one!

•Last Tuesday:

•The true meaning of static

•Singleton design pattern

•Today:

✓Astronomy

✓Administrivia

•Static, singletons, & enums (briefly)

•More RL (algorithms!)

Enums•Nothing magical about Java enums

•“Under the hood” they’re just classes + singleton design pattern

More RL

Recall: The MDP•Entire RL environment defined by a Markov decision process:

•M= 〈 S,A,T,R 〈

•S: state space

•A: action space

•T: transition function

•R: reward function

Policies•Plan of action is called a policy, π

•Policy defines what action to take in every state of the system:



•π(s42)=“FWD” == “When I find myself in state s42, execute the action FORWARD”



•π(s42)=“FWD” == “When I find myself in state s42, execute the action FORWARD”

•This is basis for learning: tweak π (indirectly) to make agent better over time

The goal of RL

•Agent’s goal:

•Find the best possible policy: π*

•Find policy, π*, that maximizes Vπ(s) for all s

•Q: What’s the simplest Java implementation of a policy?

Explicit policypublic class MyAgent implements RLAgent {private final Map<State2d,Action> _policy;

public MyAgent() {_policy=new HashMap<State2d,Action>();

}

public Action pickAction(State2d here) {if (_policy.containsKey(here)) {return _policy.get(here);

}// generate a default and add to _policy

}}

Implicit policypublic class MyAgent2 implements RLAgent {

private final Map<State2d,Map<Action,Double>> _policy;public MyAgent2() {

_policy=new HashMap<State2d,HashMap<Action,Double>>();

}

public Action pickAction(State2d here) {if (_policy.containsKey(here)) {

Action maxAct=null; double v=Double.MIN_VALUE;for (Action a : _policy.get(here).keySet()) {

if (_policy.get(here).get(a)>v) {maxAct=a;}

}return maxAct;

}// handle default action case

Q functions•Implicit policy uses the idea of a “Q function”

•Q : S × A → Reals

•For each action at each state, says how good/bad that action is

•If Q(si,a1)>Q(si,a2), then a1 is a “better” action than a2 at state si

•Represented in code with Map<Action,Double>:

•Mapping from an Action to the value (Q) of that Action

“Where should I go now?”

Q(s29,FWD)=2.38

Q(s29,BACK)=1.79

Q(s29,TURNCLOCK)=3.49

Q(s29,TURNCC)=0.74

⇒ “Best thing to do is turn clockwise”

s29

Q(s29,NOOP)=2.03

Q, cont’d•Now we have something that we can learn!

•For a given state, s, and action, a, adjust Q for that <s,a> pair

•If a seems better than _policy currently has recorded, increase Q(s,a)

•If a seems worse than _policy currently has recorded, decrease Q(s,a)

Q learning in math...•Let <s,a,r,s’> be an experience tuple

•Let a’=argmaxg{Q(s’,g)}

•“Best” action at next state, s’

•Q learning rule says: update current Q with a fraction of next state Q value:

•Q(s,a) ← Q(s,a) + α(r+γQ(s’,a’)-Q(s,a))

•0≤α<1 and 0≤γ<1 are constants that change behavior of the algorithm

Q learning in code...public class MyAgent implements Agent {

public void updateModel(SARSTuple s) {State2d start=s.getInitState();State2d end=s.getNextState();Action act=s.getAction();double r=s.getReward();double Qnow=_policy.get(start).get(act);double Qnext=_policy.get(end).findMaxQ();double Qrevised=Qnow+getAlpha()*(r+getGamma()*Qnext-Qnow);

_policy.get(start).put(act,Qrevised);}

}

Refactoring policy•Could probably make agent simpler by moving policy out to a different object:public class Policy {public getQvalue(State2d s, Action a);public pickAction(State2d s);public getQMax(State2d s);public setQvalue(State2d s, Action a,double d);

}

Date post:	21-Dec-2015
Category:	Documents
View:	213 times
Download:	0 times

RL: Algorithms time. Happy Full Moon! Administrivia Reminder: Midterm exam, this Thurs (Oct 20) Spec...

Documents