Decision making
Devika SubramanianComp 140Fall 2008
1
(c) Devika Subramanian, 2006 2
Principle of maximum expected utility
If you are in state s, and you have actions a from a set A, then the best action a* at state s is:
Expected utility of doing ain s.
3
An example
Should we have a party indoors or outside?
s
in
out
dry
wet
dry
wet
regret
relief
perfect
disasters4
s3
s2
s1
4
Utility function
A numerical score over all possible states of the world.
location weather utilityin dry 50in wet 60out dry 100out wet 0
5
Maximizing expected utility
in
out
dry
wet
dry
wet
Regret 50
Relief 60
Perfect 100
Disaster 0
Choose the action that maximizes expected utility EU(in) = 0.7 * 50 + 0.3 * 60 =53 EU(out) = 0.7 * 100 + 0.3 * 0 =70 Choose out
0.7
0.3
0.7
0.3
(c) Devika Subramanian, 2008
Robot navigation in a grid
obstacleTerminalStates:No actioncan takeyou outof thesestates.
11statesin statespace
1 2 3 4
1
2
3
startstate
(c) Devika Subramanian, 2008
Stochastic actions
Actions: N,S,E,W Effects of actions: each action achieves its
intended effect with probability 0.8, but with probability 0.1 each, the action moves the robot at right angles to its intended direction. Robot at (1,1) and it executes action N. With probability 0.8 it ends up at square (1,2), with
probability 0.1 it goes to square (2,1) and with probability 0.1, it remains at (1,1).
(c) Devika Subramanian, 2008
Markov state transition model The next state is probabilistically determined
by the current state and the current action, i.e.,
Example P((1,2)|(1,1),N) = 0.8 P((1,1)|(1,1),N) = 0.1 P((2,1)|(1,1),N) = 0.1
(c) Devika Subramanian, 2008
Probabilistic forward projection
Where can you get to from (3,2) with oneaction?
(3,2)n
s ew
(3,3) (3,2) (4,2)
0.8 0.1 0.1
(3,1) (3,2) (4,1)
0.8 0.1 0.1
(4,2) (3,1) (3,3)
0.8 0.1 0.1
(3,2) (3,1) (3,3)
0.8 0.10.1
(c) Devika Subramanian, 2008
Probabilistic forward projection with a plan
Let start state be (3,2) and let plan = NE.Can we, with probability 1, get to (4,3) while avoiding theblack hole?
(3,2)
(3,3) (4,2) (3,2)
0.8 0.10.1
0.8 0.10.1 0.8 0.10.1
N
(4,3) (3,3) (3,2) (4,2) (3,3) (3,1)
E
With probability 0.64, we reach (4,3) with fixed plan NE.
E
(c) Devika Subramanian, 2008
Value function V: history of states R We will consider additive value or utility
functions, and define a reward function r mapping states in the state space to a real number.
s0 sn
(c) Devika Subramanian, 2008
Reward function Example
r(s) = -0.01 in non-terminal states r(s) = +1 in terminal state (4,3) r(s) = -1 in terminal state (4,2)
The agent is penalized for each step, so theway we have defined the reward functionis to give the agent an incentive to get outof this grid as quickly as possible via the +1terminal state.
(c) Devika Subramanian, 2008
Markov Decision Process (MDP)
A finite set S of states A finite set A of actions State transitions P:SxAPr(S) Rewards r: SR
Rewards can be functions of the action chosenin a state, e.g. r:SxARInitial state may or may not be specified. In caseit isn’t, the objective is to find a solution no matterwhat the initial state is.
(c) Devika Subramanian, 2008
Plans and policies In a Markov environment, if the robot has no
sensors, only plans with probabilistic guarantees can be generated.
In a Markov environment that the robot can sense accurately, it can use a solution that specifies what to do for any state that it might reach. Such a mapping from states to actions is called a policy. An optimal policy maximizes the expected utility of the agent in the environment.
(c) Devika Subramanian, 2008
Optimal policy
obstacle
1 2
1
2
3
startstate
r(s) = -0.01
(c) Devika Subramanian, 2008
Optimal policy
obstacle
1 2
1
2
3
startstate
r(s) = -2
(c) Devika Subramanian, 2008
Value iteration Basic idea: calculate the expected utility V(s) of
each state s in S, and then choose actions that maximize expected utility.
s
(c) Devika Subramanian, 2008
The Maximum Expected Utility Principle
An agent picks the action a in state s that maximizes expected utility of the subsequent state
(c) Devika Subramanian, 2008
Calculating V(s)
The utility of a state is the immediate reward for that state plus the expected utility of the next state, assuming that the agent chooses the optimal action.
Bellman’s equationV (s) = r(s) + m a x a
!
s !S
P (s, a , s )V (s
)
(c) Devika Subramanian, 2008
Bellman equation
n
ws
e
We get n Bellman equations if there are n statesin the state space. Unique solutions exist to thissystem of n equations. (Bellman, 1957)
(c) Devika Subramanian, 2008
Calculating the optimal policy by value iteration Initialize V0(s) to be 0, for every s in S. Loop
do a Bellman update
t = t + 1 Until successive values of V are the same
Vt+1(s) = r(s) + m a xa
!
s!!S
P (s, a , s!)Vt(s
!)
(c) Devika Subramanian, 2008
Utility function V*
+1
obstacle
1 2
1
2
3
startstate
0.812 0.868 0.918
-10.762
0.7050.655 0.611 0.388
0.660
(c) Devika Subramanian, 2008
Termination criteria for value iteration RMS error of the utility values. Policy loss: stop when policies on subsequent
iterations are the same. Value iteration converges and computes the
optimal policy in time proportional to the square of the number of states times the number of actions.