CS292FStatRLLecture 2 Markov Decision Processes

CS292F StatRL Lecture 2 Markov Decision Processes

Instructor: Yu-Xiang WangSpring 2021

UC Santa Barbara

1

Recap: Markov Decision processes(MDP) parameterization• Infinite horizon / discounted setting

2

Initial state distribution

Transition kernel:

Discounting factor:

(Expected)reward function:

Recap: Reward function and Value functions• Immediate reward function r(s,a,s’)• expected immediate reward

• state value function: Vp(s)• expected long-term return when starting in s and following p

• state-action value function: Qp(s,a)• expected long-term return when starting in s, performing a,

and following p

r(s, a, s0) = E[R1|S1 = s,A1 = a, S2 = s0]<latexit sha1_base64="9A8DxuVxtlIxf2YElUYRp319rAQ=">AAACJHicbZDLSgMxFIYz9VbrrerSzcEirVDKTBEUpFAVwWW99ALtUDJp2oZmLiQZoYx9GDe+ihsXXnDhxmcx03ahrQcSPv5zDsn/OwFnUpnml5FYWFxaXkmuptbWNza30ts7NemHgtAq8bkvGg6WlDOPVhVTnDYCQbHrcFp3Bhdxv35PhWS+d6eGAbVd3PNYlxGstNROn4qczOO8zB4ClKDlYtV3nOhy1ISbtgUPcKvvEsg8nI0B57VSjJUs2O10xiyY44J5sKaQQdOqtNPvrY5PQpd6inAsZdMyA2VHWChGOB2lWqGkASYD3KNNjR52qbSjsckRHGilA11f6OMpGKu/NyLsSjl0HT0Zu5CzvVj8r9cMVffEjpgXhIp6ZPJQN+SgfIgTgw4TlCg+1ICJYPqvQPpYYKJ0rikdgjVreR5qxYKl+fooUz6fxpFEe2gf5ZCFjlEZXaEKqiKCHtEzekVvxpPxYnwYn5PRhDHd2UV/yvj+AQFrntE=</latexit><latexit sha1_base64="9A8DxuVxtlIxf2YElUYRp319rAQ=">AAACJHicbZDLSgMxFIYz9VbrrerSzcEirVDKTBEUpFAVwWW99ALtUDJp2oZmLiQZoYx9GDe+ihsXXnDhxmcx03ahrQcSPv5zDsn/OwFnUpnml5FYWFxaXkmuptbWNza30ts7NemHgtAq8bkvGg6WlDOPVhVTnDYCQbHrcFp3Bhdxv35PhWS+d6eGAbVd3PNYlxGstNROn4qczOO8zB4ClKDlYtV3nOhy1ISbtgUPcKvvEsg8nI0B57VSjJUs2O10xiyY44J5sKaQQdOqtNPvrY5PQpd6inAsZdMyA2VHWChGOB2lWqGkASYD3KNNjR52qbSjsckRHGilA11f6OMpGKu/NyLsSjl0HT0Zu5CzvVj8r9cMVffEjpgXhIp6ZPJQN+SgfIgTgw4TlCg+1ICJYPqvQPpYYKJ0rikdgjVreR5qxYKl+fooUz6fxpFEe2gf5ZCFjlEZXaEKqiKCHtEzekVvxpPxYnwYn5PRhDHd2UV/yvj+AQFrntE=</latexit><latexit sha1_base64="9A8DxuVxtlIxf2YElUYRp319rAQ=">AAACJHicbZDLSgMxFIYz9VbrrerSzcEirVDKTBEUpFAVwWW99ALtUDJp2oZmLiQZoYx9GDe+ihsXXnDhxmcx03ahrQcSPv5zDsn/OwFnUpnml5FYWFxaXkmuptbWNza30ts7NemHgtAq8bkvGg6WlDOPVhVTnDYCQbHrcFp3Bhdxv35PhWS+d6eGAbVd3PNYlxGstNROn4qczOO8zB4ClKDlYtV3nOhy1ISbtgUPcKvvEsg8nI0B57VSjJUs2O10xiyY44J5sKaQQdOqtNPvrY5PQpd6inAsZdMyA2VHWChGOB2lWqGkASYD3KNNjR52qbSjsckRHGilA11f6OMpGKu/NyLsSjl0HT0Zu5CzvVj8r9cMVffEjpgXhIp6ZPJQN+SgfIgTgw4TlCg+1ICJYPqvQPpYYKJ0rikdgjVreR5qxYKl+fooUz6fxpFEe2gf5ZCFjlEZXaEKqiKCHtEzekVvxpPxYnwYn5PRhDHd2UV/yvj+AQFrntE=</latexit><latexit sha1_base64="9A8DxuVxtlIxf2YElUYRp319rAQ=">AAACJHicbZDLSgMxFIYz9VbrrerSzcEirVDKTBEUpFAVwWW99ALtUDJp2oZmLiQZoYx9GDe+ihsXXnDhxmcx03ahrQcSPv5zDsn/OwFnUpnml5FYWFxaXkmuptbWNza30ts7NemHgtAq8bkvGg6WlDOPVhVTnDYCQbHrcFp3Bhdxv35PhWS+d6eGAbVd3PNYlxGstNROn4qczOO8zB4ClKDlYtV3nOhy1ISbtgUPcKvvEsg8nI0B57VSjJUs2O10xiyY44J5sKaQQdOqtNPvrY5PQpd6inAsZdMyA2VHWChGOB2lWqGkASYD3KNNjR52qbSjsckRHGilA11f6OMpGKu/NyLsSjl0HT0Zu5CzvVj8r9cMVffEjpgXhIp6ZPJQN+SgfIgTgw4TlCg+1ICJYPqvQPpYYKJ0rikdgjVreR5qxYKl+fooUz6fxpFEe2gf5ZCFjlEZXaEKqiKCHtEzekVvxpPxYnwYn5PRhDHd2UV/yvj+AQFrntE=</latexit>

r⇡(s) = Ea⇠⇡(a|s)[R1|S1 = s]<latexit sha1_base64="5x2V0+VZSv+wiAHMUOf552sRg9w=">AAACIHicbVBLSwMxGMz6rPW16tFLsAjtpeyKUC+Fogge66MP2F2XbJptQ7MPkqxQtv0pXvwrXjwoojf9NWbbPWjrQMIw830kM17MqJCG8aUtLa+srq0XNoqbW9s7u/refltECcekhSMW8a6HBGE0JC1JJSPdmBMUeIx0vOFF5nceCBc0Cu/kKCZOgPoh9SlGUkmuXuP3dkzLogJhHdoBkgPPSy8nbopsQQOYeWgsKhML3rgmHMNbddehcFy9ZFSNKeAiMXNSAjmarv5p9yKcBCSUmCEhLNOIpZMiLilmZFK0E0FihIeoTyxFQxQQ4aTTgBN4rJQe9COuTijhVP29kaJAiFHgqcksgpj3MvE/z0qkf+akNIwTSUI8e8hPGJQRzNqCPcoJlmykCMKcqr9CPEAcYak6LaoSzPnIi6R9UjUVvz4tNc7zOgrgEByBMjBBDTTAFWiCFsDgETyDV/CmPWkv2rv2MRtd0vKdA/AH2vcP0Gyg2g==</latexit><latexit sha1_base64="5x2V0+VZSv+wiAHMUOf552sRg9w=">AAACIHicbVBLSwMxGMz6rPW16tFLsAjtpeyKUC+Fogge66MP2F2XbJptQ7MPkqxQtv0pXvwrXjwoojf9NWbbPWjrQMIw830kM17MqJCG8aUtLa+srq0XNoqbW9s7u/refltECcekhSMW8a6HBGE0JC1JJSPdmBMUeIx0vOFF5nceCBc0Cu/kKCZOgPoh9SlGUkmuXuP3dkzLogJhHdoBkgPPSy8nbopsQQOYeWgsKhML3rgmHMNbddehcFy9ZFSNKeAiMXNSAjmarv5p9yKcBCSUmCEhLNOIpZMiLilmZFK0E0FihIeoTyxFQxQQ4aTTgBN4rJQe9COuTijhVP29kaJAiFHgqcksgpj3MvE/z0qkf+akNIwTSUI8e8hPGJQRzNqCPcoJlmykCMKcqr9CPEAcYak6LaoSzPnIi6R9UjUVvz4tNc7zOgrgEByBMjBBDTTAFWiCFsDgETyDV/CmPWkv2rv2MRtd0vKdA/AH2vcP0Gyg2g==</latexit><latexit sha1_base64="5x2V0+VZSv+wiAHMUOf552sRg9w=">AAACIHicbVBLSwMxGMz6rPW16tFLsAjtpeyKUC+Fogge66MP2F2XbJptQ7MPkqxQtv0pXvwrXjwoojf9NWbbPWjrQMIw830kM17MqJCG8aUtLa+srq0XNoqbW9s7u/refltECcekhSMW8a6HBGE0JC1JJSPdmBMUeIx0vOFF5nceCBc0Cu/kKCZOgPoh9SlGUkmuXuP3dkzLogJhHdoBkgPPSy8nbopsQQOYeWgsKhML3rgmHMNbddehcFy9ZFSNKeAiMXNSAjmarv5p9yKcBCSUmCEhLNOIpZMiLilmZFK0E0FihIeoTyxFQxQQ4aTTgBN4rJQe9COuTijhVP29kaJAiFHgqcksgpj3MvE/z0qkf+akNIwTSUI8e8hPGJQRzNqCPcoJlmykCMKcqr9CPEAcYak6LaoSzPnIi6R9UjUVvz4tNc7zOgrgEByBMjBBDTTAFWiCFsDgETyDV/CmPWkv2rv2MRtd0vKdA/AH2vcP0Gyg2g==</latexit><latexit sha1_base64="5x2V0+VZSv+wiAHMUOf552sRg9w=">AAACIHicbVBLSwMxGMz6rPW16tFLsAjtpeyKUC+Fogge66MP2F2XbJptQ7MPkqxQtv0pXvwrXjwoojf9NWbbPWjrQMIw830kM17MqJCG8aUtLa+srq0XNoqbW9s7u/refltECcekhSMW8a6HBGE0JC1JJSPdmBMUeIx0vOFF5nceCBc0Cu/kKCZOgPoh9SlGUkmuXuP3dkzLogJhHdoBkgPPSy8nbopsQQOYeWgsKhML3rgmHMNbddehcFy9ZFSNKeAiMXNSAjmarv5p9yKcBCSUmCEhLNOIpZMiLilmZFK0E0FihIeoTyxFQxQQ4aTTgBN4rJQe9COuTijhVP29kaJAiFHgqcksgpj3MvE/z0qkf+akNIwTSUI8e8hPGJQRzNqCPcoJlmykCMKcqr9CPEAcYak6LaoSzPnIi6R9UjUVvz4tNc7zOgrgEByBMjBBDTTAFWiCFsDgETyDV/CmPWkv2rv2MRtd0vKdA/AH2vcP0Gyg2g==</latexit>

<latexit sha1_base64="nQg091M9cWDJGCCzLb1YURkaW60=">AAACQnicbVDLSsQwFE19O75GXboJDoIillYE3QiiCC59zahMa0kzmZlg0pbkVhhqv82NX+DOD3DjQhG3LkzHIr4uBM495x5u7gkTwTU4zoM1MDg0PDI6Nl6ZmJyanqnOzjV0nCrK6jQWsToPiWaCR6wOHAQ7TxQjMhTsLLzaK/Sza6Y0j6NT6CXMl6QT8TanBAwVVC8al17Cl/UK3saeJNANw2w/DzJD5k18HLh4FXsdIiUxzbppbNv+oi4zWHNzI0Ap3OAT49jGGvuVoFpzbKdf+C9wS1BDZR0G1XuvFdNUsgioIFo3XScBPyMKOBUsr3ipZgmhV6TDmgZGRDLtZ/0IcrxkmBZux8q8CHCf/e7IiNS6J0MzWVypf2sF+Z/WTKG95Wc8SlJgEf1c1E4FhhgXeeIWV4yC6BlAqOLmr5h2iSIUTOpFCO7vk/+CxrrtGny0UdvZLeMYQwtoES0jF22iHXSADlEdUXSLHtEzerHurCfr1Xr7HB2wSs88+lHW+wfnv6pD</latexit>

<latexit sha1_base64="rv5r6WVnCPbjRTFivYSHhSUMwnE=">AAACTXicbVFdS9xAFJ2sbbWrrWv72JdLF0GpDYkI+iLYlkIftXZV2MRwMzu7Ds4kYeamsMT8wb4U+tZ/0Zc+tIh0sgapHxcGzj3nHmbumbRQ0lIQ/PQ6c48eP5lfeNpdXHr2fLm38uLI5qXhYsBzlZuTFK1QMhMDkqTESWEE6lSJ4/T8Q6MffxXGyjz7QtNCxBonmRxLjuSopDc6OI0KuWY3cB12IdJIZ2lafayTytH1ED4nIbyBaIJao2s2XeP7/g11WtHbsHYCtcIFHDrHLtgNeDcDCHE36fUDP5gV3AdhC/qsrf2k9yMa5bzUIiOu0NphGBQUV2hIciXqblRaUSA/x4kYOpihFjauZmnUsOqYEYxz405GMGP/d1SorZ3q1E0269q7WkM+pA1LGu/ElcyKkkTGry8alwoohyZaGEkjOKmpA8iNdG8FfoYGObkPaEII7658Hxxt+qHDB1v9vfdtHAvsFXvN1ljIttke+8T22YBx9o39Yn/YX++799u79K6uRzte63nJblVn/h992Kw1</latexit>

3

Recap: Optimal value functionand the MDP planning problem

4

Lemma 1.6. We have that:

[(1� �)(I � �P⇡)�1](s,a),(s0,a0) = (1� �)1X

h=0

�tP⇡

h(sh = s0, ah = a0|s0 = s, a0 = a)

so we can view the (s, a)-th row of this matrix as an induced distribution over states and actions when following ⇡after starting with s0 = s and a0 = a.

We leave the proof as an exercise to the reader.

1.1.3 Bellman optimality equations

A remarkable and convenient property of MDPs is that there exists a stationary and deterministic policy that simulta-neously maximizes V ⇡(s) for all s 2 S . This is formalized in the following theorem:

Theorem 1.7. Let ⇧ be the set of all non-stationary and randomized policies. Define:

V ?(s) := sup⇡2⇧

V ⇡(s)

Q?(s, a) := sup⇡2⇧

Q⇡(s, a).

which is finite since V ⇡(s) and Q⇡(s, a) are bounded between 0 and 1/(1� �).

There exists a stationary and deterministic policy ⇡ such that for all s 2 S and a 2 A,

V ⇡(s) = V ?(s)

Q⇡(s, a) = Q?(s, a).

We refer to such a ⇡ as an optimal policy.

Proof: First, let us show that conditioned on (s0, a0, r0, s1) = (s, a, r, s0), the maximum future discounted value,from time 1 onwards, is not a function of s, a, r. Specifically,

sup⇡2⇧

Eh 1X

t=1

�tr(st, at)�� ⇡, (s0, a0, r0, s1) = (s, a, r, s0)

i= �V ?(s0)

For any policy ⇡, define an “offset” policy ⇡(s,a,r), which is the policy that chooses actions on a trajectory ⌧ accordingto the same distribution that ⇡ chooses actions on the trajectory (s, a, r, ⌧). For example, ⇡(s,a,r)(a0 = a0|s0 = s0) isequal to the probability ⇡(a1 = a0|(s0, a0, r0, s1) = (s, a, r, s0)). By the Markov property, we have that:

Eh 1X

t=1

�tr(st, at)�� ⇡, (s0, a0, r0, s1) = (s, a, r, s0)

i= �E

h 1X

t=0

�tr(st, at)�� ⇡(s,a,r), s0 = s0

i= �V ⇡(s,a,r)(s0).

Hence, due to that V ⇡(s0) is not a function of (s, a, r), we have

sup⇡2⇧

Eh 1X

t=1

�tr(st, at)�� ⇡, (s0, a0, r0, s1) = (s, a, r, s0)

i= � · sup

⇡2⇧V ⇡(s,a,r)(s0) = � · sup

⇡2⇧V ⇡(s0) = �V ?(s0),

thus proving the claim.

8

Goal of MDP planning:

Approximate solution:

Recap: General policy, Stationarypolicy, Deterministic policy• General policy could depend on the entire history

• Stationary policy

• Stationary, Deterministic policy

5

Recap: We showed the following results about MDPs. • Proposition: It suffices to consider stationary policies.

1. Occupancy measure

2. There exists a stationary policy with the same occupancy measure

• Corollary: There is a stationary policy that is optimal for all initial states.• Proof sketch: 1. Construct an optimal non-stationary policy. 2. Apply

the above proposition.

6

Bellman equations – the fundamental equations of MDP and RL• For stationary policies there is an alternative,

recursive and more useful way of defining the V-function and Q function

• Exercise: • Prove Bellman equation from the (first principle) definition.

• Write down the Bellman equation using Q function alone.

V ⇡(s) =X

a

⇡(a|s)X

s0

P (s0|s, a)[r(s, a, s0) + �V ⇡(s0)] =X

a

⇡(a|s)Q⇡(s, a)<latexit sha1_base64="hvEuyL2OJMOnjnmOQF8BsJqf44E=">AAACWHicbZFNaxsxEIa1my/HzYeTHnsZaoptEsJuKSSXgGkvOTpQOwHvxszKsiMs7S4abcBs/CcDObR/pZfI9lLSpAOCl+fVjKRXSa4k2SD45fkbm1vbO7Xd+oe9/YPDxtHxgLLCcNHnmcrMbYIklExF30qrxG1uBOpEiZtk9mPp3zwIQzJLf9p5LmKN01ROJEfr0KiRDe6iXLapA5cQUaFHJS5gSfDRsTWh1gJ6bWo90il2YAim7cQptTpwAtEUtUaopjgUvxr0d8712nbto0YzOAtWBe9FWIkmq6o3ajxF44wXWqSWKyQahkFu4xKNlVyJRT0qSOTIZzgVQydT1ILichXMAr44MoZJZtxKLazo644SNdFcJ26nRntPb70l/J83LOzkIi5lmhdWpHx90KRQYDNYpgxjaQS3au4EciPdXYHfo0Fu3V/UXQjh2ye/F4OvZ6HT19+a3e9VHDX2iX1mbRayc9ZlV6zH+oyzZ/bH2/S2vN8+83f83fVW36t6PrJ/yj9+AaMprpk=</latexit><latexit sha1_base64="hvEuyL2OJMOnjnmOQF8BsJqf44E=">AAACWHicbZFNaxsxEIa1my/HzYeTHnsZaoptEsJuKSSXgGkvOTpQOwHvxszKsiMs7S4abcBs/CcDObR/pZfI9lLSpAOCl+fVjKRXSa4k2SD45fkbm1vbO7Xd+oe9/YPDxtHxgLLCcNHnmcrMbYIklExF30qrxG1uBOpEiZtk9mPp3zwIQzJLf9p5LmKN01ROJEfr0KiRDe6iXLapA5cQUaFHJS5gSfDRsTWh1gJ6bWo90il2YAim7cQptTpwAtEUtUaopjgUvxr0d8712nbto0YzOAtWBe9FWIkmq6o3ajxF44wXWqSWKyQahkFu4xKNlVyJRT0qSOTIZzgVQydT1ILichXMAr44MoZJZtxKLazo644SNdFcJ26nRntPb70l/J83LOzkIi5lmhdWpHx90KRQYDNYpgxjaQS3au4EciPdXYHfo0Fu3V/UXQjh2ye/F4OvZ6HT19+a3e9VHDX2iX1mbRayc9ZlV6zH+oyzZ/bH2/S2vN8+83f83fVW36t6PrJ/yj9+AaMprpk=</latexit><latexit sha1_base64="hvEuyL2OJMOnjnmOQF8BsJqf44E=">AAACWHicbZFNaxsxEIa1my/HzYeTHnsZaoptEsJuKSSXgGkvOTpQOwHvxszKsiMs7S4abcBs/CcDObR/pZfI9lLSpAOCl+fVjKRXSa4k2SD45fkbm1vbO7Xd+oe9/YPDxtHxgLLCcNHnmcrMbYIklExF30qrxG1uBOpEiZtk9mPp3zwIQzJLf9p5LmKN01ROJEfr0KiRDe6iXLapA5cQUaFHJS5gSfDRsTWh1gJ6bWo90il2YAim7cQptTpwAtEUtUaopjgUvxr0d8712nbto0YzOAtWBe9FWIkmq6o3ajxF44wXWqSWKyQahkFu4xKNlVyJRT0qSOTIZzgVQydT1ILichXMAr44MoZJZtxKLazo644SNdFcJ26nRntPb70l/J83LOzkIi5lmhdWpHx90KRQYDNYpgxjaQS3au4EciPdXYHfo0Fu3V/UXQjh2ye/F4OvZ6HT19+a3e9VHDX2iX1mbRayc9ZlV6zH+oyzZ/bH2/S2vN8+83f83fVW36t6PrJ/yj9+AaMprpk=</latexit><latexit sha1_base64="hvEuyL2OJMOnjnmOQF8BsJqf44E=">AAACWHicbZFNaxsxEIa1my/HzYeTHnsZaoptEsJuKSSXgGkvOTpQOwHvxszKsiMs7S4abcBs/CcDObR/pZfI9lLSpAOCl+fVjKRXSa4k2SD45fkbm1vbO7Xd+oe9/YPDxtHxgLLCcNHnmcrMbYIklExF30qrxG1uBOpEiZtk9mPp3zwIQzJLf9p5LmKN01ROJEfr0KiRDe6iXLapA5cQUaFHJS5gSfDRsTWh1gJ6bWo90il2YAim7cQptTpwAtEUtUaopjgUvxr0d8712nbto0YzOAtWBe9FWIkmq6o3ajxF44wXWqSWKyQahkFu4xKNlVyJRT0qSOTIZzgVQydT1ILichXMAr44MoZJZtxKLazo644SNdFcJ26nRntPb70l/J83LOzkIi5lmhdWpHx90KRQYDNYpgxjaQS3au4EciPdXYHfo0Fu3V/UXQjh2ye/F4OvZ6HT19+a3e9VHDX2iX1mbRayc9ZlV6zH+oyzZ/bH2/S2vN8+83f83fVW36t6PrJ/yj9+AaMprpk=</latexit>

Q⇡(s, a) = ?<latexit sha1_base64="/mf8o/W5UOjSa3BzWmj1u5KS65Y=">AAACBHicbZBNS8MwGMdTX+d8q3rcJTiECTJaEfQiDr143MC9wFpHmqZbWJqWJBVG6cGLX8WLB0W8+iG8+W3Muh5084HAj///efIkfy9mVCrL+jaWlldW19ZLG+XNre2dXXNvvyOjRGDSxhGLRM9DkjDKSVtRxUgvFgSFHiNdb3wz9bsPREga8Ts1iYkboiGnAcVIaWlgVlr3Tkxr8gQdw0vo5BemgvhZepUNzKpVt/KCi2AXUAVFNQfml+NHOAkJV5ghKfu2FSs3RUJRzEhWdhJJYoTHaEj6GjkKiXTTfGcGj7TiwyAS+nAFc/X3RIpCKSehpztDpEZy3puK/3n9RAUXbkp5nCjC8WxRkDCoIjhNBPpUEKzYRAPCguq3QjxCAmGlcyvrEOz5Ly9C57Rua26dVRvXRRwlUAGHoAZscA4a4BY0QRtg8AiewSt4M56MF+Pd+Ji1LhnFzAH4U8bnD2i1l1A=</latexit><latexit sha1_base64="/mf8o/W5UOjSa3BzWmj1u5KS65Y=">AAACBHicbZBNS8MwGMdTX+d8q3rcJTiECTJaEfQiDr143MC9wFpHmqZbWJqWJBVG6cGLX8WLB0W8+iG8+W3Muh5084HAj///efIkfy9mVCrL+jaWlldW19ZLG+XNre2dXXNvvyOjRGDSxhGLRM9DkjDKSVtRxUgvFgSFHiNdb3wz9bsPREga8Ts1iYkboiGnAcVIaWlgVlr3Tkxr8gQdw0vo5BemgvhZepUNzKpVt/KCi2AXUAVFNQfml+NHOAkJV5ghKfu2FSs3RUJRzEhWdhJJYoTHaEj6GjkKiXTTfGcGj7TiwyAS+nAFc/X3RIpCKSehpztDpEZy3puK/3n9RAUXbkp5nCjC8WxRkDCoIjhNBPpUEKzYRAPCguq3QjxCAmGlcyvrEOz5Ly9C57Rua26dVRvXRRwlUAGHoAZscA4a4BY0QRtg8AiewSt4M56MF+Pd+Ji1LhnFzAH4U8bnD2i1l1A=</latexit><latexit sha1_base64="/mf8o/W5UOjSa3BzWmj1u5KS65Y=">AAACBHicbZBNS8MwGMdTX+d8q3rcJTiECTJaEfQiDr143MC9wFpHmqZbWJqWJBVG6cGLX8WLB0W8+iG8+W3Muh5084HAj///efIkfy9mVCrL+jaWlldW19ZLG+XNre2dXXNvvyOjRGDSxhGLRM9DkjDKSVtRxUgvFgSFHiNdb3wz9bsPREga8Ts1iYkboiGnAcVIaWlgVlr3Tkxr8gQdw0vo5BemgvhZepUNzKpVt/KCi2AXUAVFNQfml+NHOAkJV5ghKfu2FSs3RUJRzEhWdhJJYoTHaEj6GjkKiXTTfGcGj7TiwyAS+nAFc/X3RIpCKSehpztDpEZy3puK/3n9RAUXbkp5nCjC8WxRkDCoIjhNBPpUEKzYRAPCguq3QjxCAmGlcyvrEOz5Ly9C57Rua26dVRvXRRwlUAGHoAZscA4a4BY0QRtg8AiewSt4M56MF+Pd+Ji1LhnFzAH4U8bnD2i1l1A=</latexit><latexit sha1_base64="/mf8o/W5UOjSa3BzWmj1u5KS65Y=">AAACBHicbZBNS8MwGMdTX+d8q3rcJTiECTJaEfQiDr143MC9wFpHmqZbWJqWJBVG6cGLX8WLB0W8+iG8+W3Muh5084HAj///efIkfy9mVCrL+jaWlldW19ZLG+XNre2dXXNvvyOjRGDSxhGLRM9DkjDKSVtRxUgvFgSFHiNdb3wz9bsPREga8Ts1iYkboiGnAcVIaWlgVlr3Tkxr8gQdw0vo5BemgvhZepUNzKpVt/KCi2AXUAVFNQfml+NHOAkJV5ghKfu2FSs3RUJRzEhWdhJJYoTHaEj6GjkKiXTTfGcGj7TiwyAS+nAFc/X3RIpCKSehpztDpEZy3puK/3n9RAUXbkp5nCjC8WxRkDCoIjhNBPpUEKzYRAPCguq3QjxCAmGlcyvrEOz5Ly9C57Rua26dVRvXRRwlUAGHoAZscA4a4BY0QRtg8AiewSt4M56MF+Pd+Ji1LhnFzAH4U8bnD2i1l1A=</latexit>

7

Deriving Bellman Equation forstationary policies

8

Bellman equations in matrix forms

• Lemma 1.4 (Bellman consistency): For stationary policies, we have

• In matrix forms:

9

on the quality or the price of the travel package found. In more generic conversational settings, the ultimate reward iswhether the conversation was satisfactory to the other agents or humans, or not.

Example 1.3 (Strategic games). This is a popular category of RL applications, where RL has been successful inachieving human level performance in Backgammon, Go, Chess, and various forms of Poker. The usual setting consistsof the state being the current game board, actions being the potential next moves and reward being the eventual win/lossoutcome or a more detailed score when it is defined in the game. Technically, these are multi-agent RL settings, and,yet, the algorithms used are often non-multi-agent RL algorithms.

1.1.2 Bellman consistency equations for stationary policies

Stationary policies satisfy the following consistency conditions:

Lemma 1.4. Suppose that ⇡ is a stationary policy. Then V ⇡ and Q⇡ satisfy the following Bellman consistencyequations: for all s 2 S, a 2 A,

V ⇡(s) = Q⇡(s,⇡(s)).

Q⇡(s, a) = r(s, a) + �Es0⇠P (·|s,a)

⇥V ⇡(s0)

⇤.


It is helpful to view V ⇡ as vector of length |S| and Q⇡ and r as vectors of length |S| · |A|. We overload notation andlet P also refer to a matrix of size (|S| · |A|)⇥ |S| where the entry P(s,a),s0 is equal to P (s0|s, a).

We also will define P⇡ to be the transition matrix on state-action pairs induced by a stationary policy ⇡, specifically:

P⇡

(s,a),(s0,a0) := P (s0|s, a)⇡(a0|s0).

In particular, for deterministic policies we have:

P⇡

(s,a),(s0,a0) :=

⇢P (s0|s, a) if a0 = ⇡(s0)

0 if a0 6= ⇡(s0)

With this notation, it is straightforward to verify:

Q⇡ = r + �PV ⇡

Q⇡ = r + �P⇡Q⇡ .

Corollary 1.5. We have that:Q⇡ = (I � �P⇡)�1r (0.2)

where I is the identity matrix.

Proof: To see that the I � �P⇡ is invertible, observe that for any non-zero vector x 2 R|S||A|,

k(I � �P⇡)xk1

= kx� �P⇡xk1� kxk1 � �kP⇡xk1 (triangule inequality for norms)� kxk1 � �kxk1 (each element of P⇡x is an average of x)= (1� �)kxk1 > 0 (� < 1, x 6= 0)

which implies I � �P⇡ is full rank.

The following is also a helpful lemma:

7






V ⇡(s) = Q⇡(s,⇡(s)).

Q⇡(s, a) = r(s, a) + �Es0⇠P (·|s,a)

⇥V ⇡(s0)

⇤.




P⇡

(s,a),(s0,a0) := P (s0|s, a)⇡(a0|s0).


P⇡

(s,a),(s0,a0) :=

⇢P (s0|s, a) if a0 = ⇡(s0)

0 if a0 6= ⇡(s0)


Q⇡ = r + �PV ⇡

Q⇡ = r + �P⇡Q⇡ .




k(I � �P⇡)xk1




7






V ⇡(s) = Q⇡(s,⇡(s)).

Q⇡(s, a) = r(s, a) + �Es0⇠P (·|s,a)

⇥V ⇡(s0)

⇤.




P⇡

(s,a),(s0,a0) := P (s0|s, a)⇡(a0|s0).


P⇡

(s,a),(s0,a0) :=

⇢P (s0|s, a) if a0 = ⇡(s0)

0 if a0 6= ⇡(s0)


Q⇡ = r + �PV ⇡

Q⇡ = r + �P⇡Q⇡ .




k(I � �P⇡)xk1




7

Closed-form solution for solving for value functions

10






V ⇡(s) = Q⇡(s,⇡(s)).

Q⇡(s, a) = r(s, a) + �Es0⇠P (·|s,a)

⇥V ⇡(s0)

⇤.




P⇡

(s,a),(s0,a0) := P (s0|s, a)⇡(a0|s0).


P⇡

(s,a),(s0,a0) :=

⇢P (s0|s, a) if a0 = ⇡(s0)

0 if a0 6= ⇡(s0)


Q⇡ = r + �PV ⇡

Q⇡ = r + �P⇡Q⇡ .




k(I � �P⇡)xk1




7






V ⇡(s) = Q⇡(s,⇡(s)).

Q⇡(s, a) = r(s, a) + �Es0⇠P (·|s,a)

⇥V ⇡(s0)

⇤.




P⇡

(s,a),(s0,a0) := P (s0|s, a)⇡(a0|s0).


P⇡

(s,a),(s0,a0) :=

⇢P (s0|s, a) if a0 = ⇡(s0)

0 if a0 6= ⇡(s0)


Q⇡ = r + �PV ⇡

Q⇡ = r + �P⇡Q⇡ .




k(I � �P⇡)xk1




7

Duality between value functions and occupancy measures

11






V ⇡(s) = Q⇡(s,⇡(s)).

Q⇡(s, a) = r(s, a) + �Es0⇠P (·|s,a)

⇥V ⇡(s0)

⇤.




P⇡

(s,a),(s0,a0) := P (s0|s, a)⇡(a0|s0).


P⇡

(s,a),(s0,a0) :=

⇢P (s0|s, a) if a0 = ⇡(s0)

0 if a0 6= ⇡(s0)


Q⇡ = r + �PV ⇡

Q⇡ = r + �P⇡Q⇡ .




k(I � �P⇡)xk1




7






V ⇡(s) = Q⇡(s,⇡(s)).

Q⇡(s, a) = r(s, a) + �Es0⇠P (·|s,a)

⇥V ⇡(s0)

⇤.




P⇡

(s,a),(s0,a0) := P (s0|s, a)⇡(a0|s0).


P⇡

(s,a),(s0,a0) :=

⇢P (s0|s, a) if a0 = ⇡(s0)

0 if a0 6= ⇡(s0)


Q⇡ = r + �PV ⇡

Q⇡ = r + �P⇡Q⇡ .




k(I � �P⇡)xk1




7

Invertibility of the matrix






V ⇡(s) = Q⇡(s,⇡(s)).

Q⇡(s, a) = r(s, a) + �Es0⇠P (·|s,a)

⇥V ⇡(s0)

⇤.




P⇡

(s,a),(s0,a0) := P (s0|s, a)⇡(a0|s0).


P⇡

(s,a),(s0,a0) :=

⇢P (s0|s, a) if a0 = ⇡(s0)

0 if a0 6= ⇡(s0)


Q⇡ = r + �PV ⇡

Q⇡ = r + �P⇡Q⇡ .




k(I � �P⇡)xk1




712

Corollary 1.5 in AJKS: the matrix is full rank / invertible for all gamma < 1.

Proof:

Bellman optimality equationscharacterizes the optimal policy

• system of n non-linear equations• solve for V*(s)• easy to extract the optimal policy

• having Q*(s,a) makes it even simpler

V ⇤(s) = maxa

X

s0

P (s0|s, a)[r(s, a, s0) + �V ⇤(s0)]<latexit sha1_base64="BurMntF5zkBG9ECm8Q/4xRsE96g=">AAACLXicbZBNS8NAEIY3flu/qh69LBZpqyKJCHoRRD14rGCr0MQw2W7r4m4SdjZiif1DXvwrInhQxKt/w23twa8XFh7emWF23iiVAo3rvjgjo2PjE5NT04WZ2bn5heLiUgOTTDNeZ4lM9EUEyKWIed0II/lFqjmoSPLz6PqoXz+/4RpFEp+ZbsoDBZ1YtAUDY62weNy4XK9gle5TX8FtmEPPx0yFOZZ7tFbB8h1uQpU2qa5Y2MRylW5QvwNKAR1MWiMohMWSu+UORP+CN4QSGaoWFp/8VsIyxWPDJCA2PTc1QQ7aCCZ5r+BnyFNg19DhTYsxKI5BPri2R9es06LtRNsXGzpwv0/koBC7KrKdCswV/q71zf9qzcy094JcxGlmeMy+FrUzSU1C+9HRltCcGdm1AEwL+1fKrkADMzbgfgje75P/QmN7y7N8ulM6OBzGMUVWyCqpEI/skgNyQmqkThi5J4/khbw6D86z8+a8f7WOOMOZZfJDzscnA7+jow==</latexit><latexit sha1_base64="BurMntF5zkBG9ECm8Q/4xRsE96g=">AAACLXicbZBNS8NAEIY3flu/qh69LBZpqyKJCHoRRD14rGCr0MQw2W7r4m4SdjZiif1DXvwrInhQxKt/w23twa8XFh7emWF23iiVAo3rvjgjo2PjE5NT04WZ2bn5heLiUgOTTDNeZ4lM9EUEyKWIed0II/lFqjmoSPLz6PqoXz+/4RpFEp+ZbsoDBZ1YtAUDY62weNy4XK9gle5TX8FtmEPPx0yFOZZ7tFbB8h1uQpU2qa5Y2MRylW5QvwNKAR1MWiMohMWSu+UORP+CN4QSGaoWFp/8VsIyxWPDJCA2PTc1QQ7aCCZ5r+BnyFNg19DhTYsxKI5BPri2R9es06LtRNsXGzpwv0/koBC7KrKdCswV/q71zf9qzcy094JcxGlmeMy+FrUzSU1C+9HRltCcGdm1AEwL+1fKrkADMzbgfgje75P/QmN7y7N8ulM6OBzGMUVWyCqpEI/skgNyQmqkThi5J4/khbw6D86z8+a8f7WOOMOZZfJDzscnA7+jow==</latexit><latexit sha1_base64="BurMntF5zkBG9ECm8Q/4xRsE96g=">AAACLXicbZBNS8NAEIY3flu/qh69LBZpqyKJCHoRRD14rGCr0MQw2W7r4m4SdjZiif1DXvwrInhQxKt/w23twa8XFh7emWF23iiVAo3rvjgjo2PjE5NT04WZ2bn5heLiUgOTTDNeZ4lM9EUEyKWIed0II/lFqjmoSPLz6PqoXz+/4RpFEp+ZbsoDBZ1YtAUDY62weNy4XK9gle5TX8FtmEPPx0yFOZZ7tFbB8h1uQpU2qa5Y2MRylW5QvwNKAR1MWiMohMWSu+UORP+CN4QSGaoWFp/8VsIyxWPDJCA2PTc1QQ7aCCZ5r+BnyFNg19DhTYsxKI5BPri2R9es06LtRNsXGzpwv0/koBC7KrKdCswV/q71zf9qzcy094JcxGlmeMy+FrUzSU1C+9HRltCcGdm1AEwL+1fKrkADMzbgfgje75P/QmN7y7N8ulM6OBzGMUVWyCqpEI/skgNyQmqkThi5J4/khbw6D86z8+a8f7WOOMOZZfJDzscnA7+jow==</latexit><latexit sha1_base64="BurMntF5zkBG9ECm8Q/4xRsE96g=">AAACLXicbZBNS8NAEIY3flu/qh69LBZpqyKJCHoRRD14rGCr0MQw2W7r4m4SdjZiif1DXvwrInhQxKt/w23twa8XFh7emWF23iiVAo3rvjgjo2PjE5NT04WZ2bn5heLiUgOTTDNeZ4lM9EUEyKWIed0II/lFqjmoSPLz6PqoXz+/4RpFEp+ZbsoDBZ1YtAUDY62weNy4XK9gle5TX8FtmEPPx0yFOZZ7tFbB8h1uQpU2qa5Y2MRylW5QvwNKAR1MWiMohMWSu+UORP+CN4QSGaoWFp/8VsIyxWPDJCA2PTc1QQ7aCCZ5r+BnyFNg19DhTYsxKI5BPri2R9es06LtRNsXGzpwv0/koBC7KrKdCswV/q71zf9qzcy094JcxGlmeMy+FrUzSU1C+9HRltCcGdm1AEwL+1fKrkADMzbgfgje75P/QmN7y7N8ulM6OBzGMUVWyCqpEI/skgNyQmqkThi5J4/khbw6D86z8+a8f7WOOMOZZfJDzscnA7+jow==</latexit>

13

Proposition: There is a deterministic, stationary and optimal policy.• And it is given by:

• Proof:

14

The crux of solving the MDP planning problem is to construct Q*• In the remainder of this lecture, we will talk about

two approaches

1. By solving a Linear Program

2. By solving Bellman equations / Bellman optimality equations.

15

The linear programming approach

• Solve for V* by solving the following LP

16

Iteration complexity for an exact solution. With regards to computing an exact optimal policy, it clear from theprevious results that policy iteration is no worse than value iteration. However, with regards to obtaining an exactsolution MDP that is independent of the bit complexity, L(P, r, �), improvements are possible (and where we assumebasic arithmetic operations on real numbers are order one cost). Naively, the number of iterations of policy iterationsis bounded by the number of policies, namely |A||S|; here, a small improvement is possible, where the number ofiterations of policy iteration can be bounded by |A|

|S|

|S|. Remarkably, for a fixed value of �, policy iteration can be

show to be a strongly polynomial time algorithm, where policy iteration finds an exact policy in at most|S|

2|A| log |S|2

1��

1��

iterations. See Table 0.1 for a summary, and Section 1.7 for references.

1.5 The Linear Programming Approach

It is helpful to understand an alternative approach to finding an optimal policy for a known MDP. With regards tocomputation, consider the setting where our MDP M = (S,A, P, r, �, µ) is known and P , r, and � are all specified byrational numbers. Here, from a computational perspective, the previous iterative algorithms are, strictly speaking, notpolynomial time algorithms, due to that they depend polynomially on 1/(1 � �), which is not polynomial in the de-scription length of the MDP . In particular, note that any rational value of 1�� may be specified with only O(log 1

1��)

bits of precision. In this context, we may hope for a fully polynomial time algorithm, when given knowledge of theMDP, which would have a computation time which would depend polynomially on the description length of the MDPM , when the parameters are specified as rational numbers. We now see that the LP approach provides a polynomialtime algorithm.

1.5.1 The Primal LP and A Polynomial Time Algorithm

Consider the following optimization problem with variables V 2 R|S|:

minX

s

µ(s)V (s)

subject to V (s) � r(s, a) + �X

s0

P (s0|s, a)V (s0) 8a 2 A, s 2 S

Here, the optimal value function V ?(s) is the unique solution to this linear program. With regards to computationtime, linear programming approaches only depend on the description length of the coefficients in the program, dueto that this determines the computational complexity of basic additions and multiplications. Thus, this approach willonly depend on the bit length description of the MDP, when the MDP is specified by rational numbers.

Computational complexity for an exact solution. Table 0.1 shows the runtime complexity for the LP approach,where we assume a standard runtime for solving a linear program. The strongly polynomial algorithm is an interiorpoint algorithm. See Section 1.7 for references.

Policy iteration and the simplex algorithm. It turns out that the policy iteration algorithm is actually the simplexmethod with block pivot. While the simplex method, in general, is not a strongly polynomial time algorithm, thepolicy iteration algorithm is a strongly polynomial time algorithm, provided we keep the discount factor fixed. See[Ye, 2011].

15

The linear programming approach

• Solve for V* by solving the following LP

16

Iteration complexity for an exact solution. With regards to computing an exact optimal policy, it clear from theprevious results that policy iteration is no worse than value iteration. However, with regards to obtaining an exactsolution MDP that is independent of the bit complexity, L(P, r, �), improvements are possible (and where we assumebasic arithmetic operations on real numbers are order one cost). Naively, the number of iterations of policy iterationsis bounded by the number of policies, namely |A||S|; here, a small improvement is possible, where the number ofiterations of policy iteration can be bounded by |A|

|S|

|S|. Remarkably, for a fixed value of �, policy iteration can be

show to be a strongly polynomial time algorithm, where policy iteration finds an exact policy in at most|S|

2|A| log |S|2

1��

1��

iterations. See Table 0.1 for a summary, and Section 1.7 for references.

1.5 The Linear Programming Approach

It is helpful to understand an alternative approach to finding an optimal policy for a known MDP. With regards tocomputation, consider the setting where our MDP M = (S,A, P, r, �, µ) is known and P , r, and � are all specified byrational numbers. Here, from a computational perspective, the previous iterative algorithms are, strictly speaking, notpolynomial time algorithms, due to that they depend polynomially on 1/(1 � �), which is not polynomial in the de-scription length of the MDP . In particular, note that any rational value of 1�� may be specified with only O(log 1

1��)

bits of precision. In this context, we may hope for a fully polynomial time algorithm, when given knowledge of theMDP, which would have a computation time which would depend polynomially on the description length of the MDPM , when the parameters are specified as rational numbers. We now see that the LP approach provides a polynomialtime algorithm.

1.5.1 The Primal LP and A Polynomial Time Algorithm

Consider the following optimization problem with variables V 2 R|S|:

minX

s

µ(s)V (s)

subject to V (s) � r(s, a) + �X

s0

P (s0|s, a)V (s0) 8a 2 A, s 2 S

Here, the optimal value function V ?(s) is the unique solution to this linear program. With regards to computationtime, linear programming approaches only depend on the description length of the coefficients in the program, dueto that this determines the computational complexity of basic additions and multiplications. Thus, this approach willonly depend on the bit length description of the MDP, when the MDP is specified by rational numbers.

Computational complexity for an exact solution. Table 0.1 shows the runtime complexity for the LP approach,where we assume a standard runtime for solving a linear program. The strongly polynomial algorithm is an interiorpoint algorithm. See Section 1.7 for references.

Policy iteration and the simplex algorithm. It turns out that the policy iteration algorithm is actually the simplexmethod with block pivot. While the simplex method, in general, is not a strongly polynomial time algorithm, thepolicy iteration algorithm is a strongly polynomial time algorithm, provided we keep the discount factor fixed. See[Ye, 2011].

15

Quiz 1: Once we have V*, how to construct Q*?

The Lagrange dual of the LP

17

• Exercise: Deriving the dual by applying the standard procedure.

The Lagrange dual of the LP

17

• Exercise: Deriving the dual by applying the standard procedure.

Quiz 2: Once we have the solution how to construct the policy?

Value iterations for MDP planning

• Recall: Bellman optimality equations

18

V ⇤(s) = maxa

X

s0

P (s0|s, a)[r(s, a, s0) + �V ⇤(s0)]<latexit sha1_base64="BurMntF5zkBG9ECm8Q/4xRsE96g=">AAACLXicbZBNS8NAEIY3flu/qh69LBZpqyKJCHoRRD14rGCr0MQw2W7r4m4SdjZiif1DXvwrInhQxKt/w23twa8XFh7emWF23iiVAo3rvjgjo2PjE5NT04WZ2bn5heLiUgOTTDNeZ4lM9EUEyKWIed0II/lFqjmoSPLz6PqoXz+/4RpFEp+ZbsoDBZ1YtAUDY62weNy4XK9gle5TX8FtmEPPx0yFOZZ7tFbB8h1uQpU2qa5Y2MRylW5QvwNKAR1MWiMohMWSu+UORP+CN4QSGaoWFp/8VsIyxWPDJCA2PTc1QQ7aCCZ5r+BnyFNg19DhTYsxKI5BPri2R9es06LtRNsXGzpwv0/koBC7KrKdCswV/q71zf9qzcy094JcxGlmeMy+FrUzSU1C+9HRltCcGdm1AEwL+1fKrkADMzbgfgje75P/QmN7y7N8ulM6OBzGMUVWyCqpEI/skgNyQmqkThi5J4/khbw6D86z8+a8f7WOOMOZZfJDzscnA7+jow==</latexit><latexit sha1_base64="BurMntF5zkBG9ECm8Q/4xRsE96g=">AAACLXicbZBNS8NAEIY3flu/qh69LBZpqyKJCHoRRD14rGCr0MQw2W7r4m4SdjZiif1DXvwrInhQxKt/w23twa8XFh7emWF23iiVAo3rvjgjo2PjE5NT04WZ2bn5heLiUgOTTDNeZ4lM9EUEyKWIed0II/lFqjmoSPLz6PqoXz+/4RpFEp+ZbsoDBZ1YtAUDY62weNy4XK9gle5TX8FtmEPPx0yFOZZ7tFbB8h1uQpU2qa5Y2MRylW5QvwNKAR1MWiMohMWSu+UORP+CN4QSGaoWFp/8VsIyxWPDJCA2PTc1QQ7aCCZ5r+BnyFNg19DhTYsxKI5BPri2R9es06LtRNsXGzpwv0/koBC7KrKdCswV/q71zf9qzcy094JcxGlmeMy+FrUzSU1C+9HRltCcGdm1AEwL+1fKrkADMzbgfgje75P/QmN7y7N8ulM6OBzGMUVWyCqpEI/skgNyQmqkThi5J4/khbw6D86z8+a8f7WOOMOZZfJDzscnA7+jow==</latexit><latexit sha1_base64="BurMntF5zkBG9ECm8Q/4xRsE96g=">AAACLXicbZBNS8NAEIY3flu/qh69LBZpqyKJCHoRRD14rGCr0MQw2W7r4m4SdjZiif1DXvwrInhQxKt/w23twa8XFh7emWF23iiVAo3rvjgjo2PjE5NT04WZ2bn5heLiUgOTTDNeZ4lM9EUEyKWIed0II/lFqjmoSPLz6PqoXz+/4RpFEp+ZbsoDBZ1YtAUDY62weNy4XK9gle5TX8FtmEPPx0yFOZZ7tFbB8h1uQpU2qa5Y2MRylW5QvwNKAR1MWiMohMWSu+UORP+CN4QSGaoWFp/8VsIyxWPDJCA2PTc1QQ7aCCZ5r+BnyFNg19DhTYsxKI5BPri2R9es06LtRNsXGzpwv0/koBC7KrKdCswV/q71zf9qzcy094JcxGlmeMy+FrUzSU1C+9HRltCcGdm1AEwL+1fKrkADMzbgfgje75P/QmN7y7N8ulM6OBzGMUVWyCqpEI/skgNyQmqkThi5J4/khbw6D86z8+a8f7WOOMOZZfJDzscnA7+jow==</latexit><latexit sha1_base64="BurMntF5zkBG9ECm8Q/4xRsE96g=">AAACLXicbZBNS8NAEIY3flu/qh69LBZpqyKJCHoRRD14rGCr0MQw2W7r4m4SdjZiif1DXvwrInhQxKt/w23twa8XFh7emWF23iiVAo3rvjgjo2PjE5NT04WZ2bn5heLiUgOTTDNeZ4lM9EUEyKWIed0II/lFqjmoSPLz6PqoXz+/4RpFEp+ZbsoDBZ1YtAUDY62weNy4XK9gle5TX8FtmEPPx0yFOZZ7tFbB8h1uQpU2qa5Y2MRylW5QvwNKAR1MWiMohMWSu+UORP+CN4QSGaoWFp/8VsIyxWPDJCA2PTc1QQ7aCCZ5r+BnyFNg19DhTYsxKI5BPri2R9es06LtRNsXGzpwv0/koBC7KrKdCswV/q71zf9qzcy094JcxGlmeMy+FrUzSU1C+9HRltCcGdm1AEwL+1fKrkADMzbgfgje75P/QmN7y7N8ulM6OBzGMUVWyCqpEI/skgNyQmqkThi5J4/khbw6D86z8+a8f7WOOMOZZfJDzscnA7+jow==</latexit>

We now show the deterministic and stationary policy ⇡(s) = argmaxa2A

sup⇡02⇧ Q⇡

0(s, a) satisfies V ⇡(s) =

sup⇡02⇧ V ⇡

0(s). For this, we have that:

V ?(s0) = sup⇡2⇧

Ehr(s0, a0) +

1X

t=1

�tr(st, at)i

= sup⇡2⇧

Ehr(s0, a0) + E

h 1X

t=1

�tr(st, at)�� ⇡, (s0, a0, r0, s1)

ii

sup⇡2⇧

Ehr(s0, a0) + sup

⇡02⇧Eh 1X

t=1

�tr(st, at)�� ⇡0, (s0, a0, r0, s1)

ii

= sup⇡2⇧

Ehr(s0, a0) + �V ?(s1)

i

= Ehr(s0, a0) + �V ?(s1)

�� ⇡i.

where the second equality is by the tower property of conditional expectations, and the last equality follows from thedefinition of ⇡. Now, by recursion,

V ?(s0) Ehr(s0, a0) + �V ?(s1)

�� ⇡i E

hr(s0, a0) + �r(s1, a1) + �2V ?(s2)

�� ⇡i . . . V ⇡(s0).

Since V ⇡(s) sup⇡02⇧ V ⇡

0(s) = V ?(s), we have that V ⇡ = V ?, which completes the proof of the first claim.

For the same policy ⇡, an analogous argument can be used prove the second claim.

This shows that we may restrict ourselves to using stationary and deterministic policies without any loss in perfor-mance. The following theorem, also due to [Bellman, 1956], gives a precise characterization of the optimal valuefunction.

Let us say that a vector Q 2 R|S||A| satisfies the Bellman optimality equations if:

Q(s, a) = r(s, a) + �Es0⇠P (·|s,a)

maxa02A

Q(s0, a0)

�.

Theorem 1.8 (Bellman Optimality Equations). For any Q 2 R|S||A|, we have that Q = Q? if and only if Q satisfiesthe Bellman optimality equations. Furthermore, the deterministic policy ⇡(s) 2 Q?(s, a) is an optimal policy (whereties are broken in some arbitrary and deterministic manner).

Before we prove this claim, we will provide a few definitions. Let ⇡Q denote the greedy policy with respect to a vectorQ 2 R|S||A|, i.e

⇡Q(s) := argmaxa2A

Q(s, a) .

where ties are broken in some arbitrary (and deterministic) manner. With this notation, by the above theorem, theoptimal policy ⇡? is given by:

⇡? = ⇡Q? .

Let us also use the following notation to turn a vector Q 2 R|S||A| into a vector of length |S|.

VQ(s) := maxa2A

Q(s, a).

The Bellman optimality operator TM : R|S||A| ! R|S||A| is defined as:

T Q := r + �PVQ . (0.3)

9

Theorem 1.8 (AJKS): Q = Q* if and only if Q satisfies the Bellman optimality equations.

We now show the deterministic and stationary policy ⇡(s) = argmaxa2A

sup⇡02⇧ Q⇡

0(s, a) satisfies V ⇡(s) =

sup⇡02⇧ V ⇡

0(s). For this, we have that:

V ?(s0) = sup⇡2⇧

Ehr(s0, a0) +

1X

t=1

�tr(st, at)i

= sup⇡2⇧

Ehr(s0, a0) + E

h 1X

t=1

�tr(st, at)�� ⇡, (s0, a0, r0, s1)

ii

sup⇡2⇧

Ehr(s0, a0) + sup

⇡02⇧Eh 1X

t=1

�tr(st, at)�� ⇡0, (s0, a0, r0, s1)

ii

= sup⇡2⇧

Ehr(s0, a0) + �V ?(s1)

i

= Ehr(s0, a0) + �V ?(s1)

�� ⇡i.

where the second equality is by the tower property of conditional expectations, and the last equality follows from thedefinition of ⇡. Now, by recursion,

V ?(s0) Ehr(s0, a0) + �V ?(s1)

�� ⇡i E

hr(s0, a0) + �r(s1, a1) + �2V ?(s2)

�� ⇡i . . . V ⇡(s0).

Since V ⇡(s) sup⇡02⇧ V ⇡

0(s) = V ?(s), we have that V ⇡ = V ?, which completes the proof of the first claim.

For the same policy ⇡, an analogous argument can be used prove the second claim.

This shows that we may restrict ourselves to using stationary and deterministic policies without any loss in perfor-mance. The following theorem, also due to [Bellman, 1956], gives a precise characterization of the optimal valuefunction.

Let us say that a vector Q 2 R|S||A| satisfies the Bellman optimality equations if:

Q(s, a) = r(s, a) + �Es0⇠P (·|s,a)

maxa02A

Q(s0, a0)

�.

Theorem 1.8 (Bellman Optimality Equations). For any Q 2 R|S||A|, we have that Q = Q? if and only if Q satisfiesthe Bellman optimality equations. Furthermore, the deterministic policy ⇡(s) 2 Q?(s, a) is an optimal policy (whereties are broken in some arbitrary and deterministic manner).

Before we prove this claim, we will provide a few definitions. Let ⇡Q denote the greedy policy with respect to a vectorQ 2 R|S||A|, i.e

⇡Q(s) := argmaxa2A

Q(s, a) .

where ties are broken in some arbitrary (and deterministic) manner. With this notation, by the above theorem, theoptimal policy ⇡? is given by:

⇡? = ⇡Q? .

Let us also use the following notation to turn a vector Q 2 R|S||A| into a vector of length |S|.

VQ(s) := maxa2A

Q(s, a).

The Bellman optimality operator TM : R|S||A| ! R|S||A| is defined as:

T Q := r + �PVQ . (0.3)

9

where


• The value iteration algorithm iteratively applies the Bellman operator until it converges.

1. Initialize Q0 arbitrarily

2. for i in 1,2,3,…, k, update

3. Return Qk

19


• The value iteration algorithm iteratively applies the Bellman operator until it converges.

1. Initialize Q0 arbitrarily

2. for i in 1,2,3,…, k, update

3. Return Qk

• What is the right question to ask here?

19

Convergence analysis of VI

• Lemma 1. The Bellman operator is a γ-contraction.

20

Value Iteration Policy Iteration LP-Algorithms

Poly? |S|2|A|L(P,r,�) log 11��

1��(|S|3 + |S|2|A|)L(P,r,�) log 1

1��

1��|S|3|A|L(P, r, �)

Strongly Poly? 7 (|S|3 + |S|2|A|) ·min

⇢|A|

|S|

|S|,|S|

2|A| log |S|2

1��

1��

�|S|4|A|4 log |S|

1��

Table 0.1: Computational complexities of various approaches (we drop universal constants). Polynomial time algo-rithms depend on the bit complexity, L(P, r, �), while strongly polynomial algorithms do not. Note that only for afixed value of � are value and policy iteration polynomial time algorithms; otherwise, they are not polynomial timealgorithms. Similarly, only for a fixed value of � is policy iteration a strongly polynomial time algorithm. In contrast,the LP-approach leads to both polynomial time and strongly polynomial time algorithms; for the latter, the approachis an interior point algorithm. See text for further discussion, and Section 1.7 for references. Here, |S|2|A| is theassumed runtime per iteration of value iteration, and |S|3 + |S|2|A| is the assumed runtime per iteration of policyiteration (note that for this complexity we would directly update the values V rather than Q values, as described in thetext); these runtimes are consistent with assuming cubic complexity for linear system solving.

Suppose that (P, r, �) in our MDP M is specified with rational entries. Let L(P, r, �) denote the total bit-size requiredto specify M , and assume that basic arithmetic operations +,�,⇥,÷ take unit time. Here, we may hope for analgorithm which (exactly) returns an optimal policy whose runtime is polynomial in L(P, r, �) and the number ofstates and actions.

More generally, it may also be helpful to understand which algorithms are strongly polynomial. Here, we do not wantto explicitly restrict (P, r, �) to be specified by rationals. An algorithm is said to be strongly polynomial if it returnsan optimal policy with runtime that is polynomial in only the number of states and actions (with no dependence onL(P, r, �)).

1.4 Iterative Methods

Planning refers to the problem of computing ⇡?

Mgiven the MDP specification M = (S,A, P, r, �). This section

reviews classical planning algorithms that compute Q?.

1.4.1 Value Iteration

A simple algorithm is to iteratively apply the fixed point mapping: starting at some Q, we iteratively apply T :

Q T Q ,

This is algorithm is referred to as Q-value iteration.

Lemma 1.10. (contraction) For any two vectors Q,Q0 2 R|S||A|,

kT Q� T Q0k1 �kQ�Q0k1

Proof: First, let us show that for all s 2 S , |VQ(s)�VQ0(s)| maxa2A |Q(s, a)�Q0(s, a)|. Assume VQ(s) > VQ0(s)(the other direction is symmetric), and let a be the greedy action for Q at s. Then

|VQ(s)� VQ0(s)| = Q(s, a)�maxa02A

Q0(s, a0) Q(s, a)�Q0(s, a) maxa2A

|Q(s, a)�Q0(s, a)|.

12



1��(|S|3 + |S|2|A|)L(P,r,�) log 1

1��

1��|S|3|A|L(P, r, �)


⇢|A|

|S|

|S|,|S|

2|A| log |S|2

1��

1��

�|S|4|A|4 log |S|

1��










Q T Q ,







|Q(s, a)�Q0(s, a)|.

12

Convergence analysis of VI

• Lemma 2. Convergence of the Q function.

21

Quiz 3: Computing “Iteration complexity” from “convergence bound”?

Convergence of the Q function implies the convergence of the value of the induced policy.Lemma 1.11 AJKS (Q-error amplification):

22

Using this,

kT Q� T Q0k1 = �kPVQ � PVQ0k1= �kP (VQ � VQ0)k1 �kVQ � VQ0k1= �max

s

|VQ(s)� VQ0(s)|

�maxs

maxa

|Q(s, a)�Q0(s, a)|

= �kQ�Q0k1

where the first inequality uses that each element of P (VQ � VQ0) is a convex average of VQ � VQ0 and the secondinequality uses our claim above.

The following result bounds the sub-optimality of the greedy policy itself, based on the error in Q-value function.

Lemma 1.11. (Q-Error Amplification) For any vector Q 2 R|S||A|,

V ⇡Q � V ? � 2kQ�Q?k11� �

1.

where 1 denotes the vector of all ones.

Proof: Fix state s and let a = ⇡Q(s). We have:

V ?(s)� V ⇡Q(s) =Q?(s,⇡?(s))�Q⇡Q(s, a)

=Q?(s,⇡?(s))�Q?(s, a) +Q?(s, a)�Q⇡Q(s, a)

=Q?(s,⇡?(s))�Q?(s, a) + �Es0⇠P (·|s,a)[V?(s0)� V ⇡Q(s0)]

Q?(s,⇡?(s))�Q(s,⇡?(s)) +Q(s, a)�Q?(s, a)

+ �Es0⇠P (s,a)[V?(s0)� V ⇡Q(s0)]

2kQ�Q?k1 + �kV ? � V ⇡Qk1.

where the first inequality uses Q(s,⇡?(s)) Q(s,⇡Q(s)) = Q(s, a) due to the definition of ⇡Q.

Theorem 1.12. (Q-value iteration convergence). Set Q(0) = 0. For k = 0, 1, . . ., suppose:

Q(k+1) = T Q(k)

Let ⇡(k) = ⇡Q(k) . For k �log 2

(1��)2✏

1��,

V ⇡(k)

� V ? � ✏1 .

Proof: Since kQ?k1 1/(1� �), Q(k) = T kQ(0) and Q? = T Q?, Lemma 1.10 gives

kQ(k) �Q?k1 = kT kQ(0) � T kQ?k1 �kkQ(0) �Q?k1 = (1� (1� �))kkQ?k1 exp(�(1� �)k)

1� �.

The proof is completed with our choice of � and using Lemma 1.11.

Iteration complexity for an exact solution. With regards to computing an exact optimal policy, when the gapbetween the current objective value and the optimal objective value is smaller than 2�L(P,r,�), then the greedy policywill be optimal. This leads to claimed complexity in Table 0.1. Value iteration is not strongly polynomial algorithmdue to that, in finite time, it may never return the optimal policy.

13

Using this,

kT Q� T Q0k1 = �kPVQ � PVQ0k1= �kP (VQ � VQ0)k1 �kVQ � VQ0k1= �max

s

|VQ(s)� VQ0(s)|

�maxs

maxa

|Q(s, a)�Q0(s, a)|

= �kQ�Q0k1

where the first inequality uses that each element of P (VQ � VQ0) is a convex average of VQ � VQ0 and the secondinequality uses our claim above.

The following result bounds the sub-optimality of the greedy policy itself, based on the error in Q-value function.

Lemma 1.11. (Q-Error Amplification) For any vector Q 2 R|S||A|,

V ⇡Q � V ? � 2kQ�Q?k11� �

1.

where 1 denotes the vector of all ones.

Proof: Fix state s and let a = ⇡Q(s). We have:

V ?(s)� V ⇡Q(s) =Q?(s,⇡?(s))�Q⇡Q(s, a)

=Q?(s,⇡?(s))�Q?(s, a) +Q?(s, a)�Q⇡Q(s, a)

=Q?(s,⇡?(s))�Q?(s, a) + �Es0⇠P (·|s,a)[V?(s0)� V ⇡Q(s0)]

Q?(s,⇡?(s))�Q(s,⇡?(s)) +Q(s, a)�Q?(s, a)

+ �Es0⇠P (s,a)[V?(s0)� V ⇡Q(s0)]

2kQ�Q?k1 + �kV ? � V ⇡Qk1.

where the first inequality uses Q(s,⇡?(s)) Q(s,⇡Q(s)) = Q(s, a) due to the definition of ⇡Q.

Theorem 1.12. (Q-value iteration convergence). Set Q(0) = 0. For k = 0, 1, . . ., suppose:

Q(k+1) = T Q(k)

Let ⇡(k) = ⇡Q(k) . For k �log 2

(1��)2✏

1��,

V ⇡(k)

� V ? � ✏1 .

Proof: Since kQ?k1 1/(1� �), Q(k) = T kQ(0) and Q? = T Q?, Lemma 1.10 gives

kQ(k) �Q?k1 = kT kQ(0) � T kQ?k1 �kkQ(0) �Q?k1 = (1� (1� �))kkQ?k1 exp(�(1� �)k)

1� �.

The proof is completed with our choice of � and using Lemma 1.11.

Iteration complexity for an exact solution. With regards to computing an exact optimal policy, when the gapbetween the current objective value and the optimal objective value is smaller than 2�L(P,r,�), then the greedy policywill be optimal. This leads to claimed complexity in Table 0.1. Value iteration is not strongly polynomial algorithmdue to that, in finite time, it may never return the optimal policy.

13

An alternative method: policy iteration1.4.2 Policy Iteration

The policy iteration algorithm starts from an arbitrary policy ⇡0, and repeat the following iterative procedure: fork = 0, 1, 2, . . .

1. Policy evaluation. Compute Q⇡k

2. Policy improvement. Update the policy:

⇡k+1 = ⇡Q⇡k

In each iteration, we compute the Q-value function of ⇡k, using the analytical form given in Equation 0.2, and updatethe policy to be greedy with respect to this new Q-value. The first step is often called policy evaluation, and the secondstep is often called policy improvement.


1. Q⇡k+1 � T Q⇡k � Q⇡k

2. kQ⇡k+1 �Q?k1 �kQ⇡k �Q?k1

Proof: First let us show that T Q⇡k � Q⇡k . Note that the policies produced in policy iteration are always deterministic,so V ⇡k(s) = Q⇡k(s,⇡k(s)) for all iterations k and states s. Hence,

T Q⇡k(s, a) = r(s, a) + �Es0⇠P (·|s,a)[maxa0

Q⇡k(s0, a0)]

� r(s, a) + �Es0⇠P (·|s,a)[Q⇡k(s0,⇡k(s

0))] = Q⇡k(s, a).

Now let us prove that Q⇡k+1 � T Q⇡k . First, let use see that Q⇡k+1 � Q⇡k :

Q⇡k = r + �P⇡kQ⇡k r + �P⇡k+1Q⇡k 1X

t=0

�t(P⇡k+1)tr = Q⇡k+1 .

where we have used that ⇡k+1 is the greedy policy in the first inequality and recursion in the second inequality. Usingthis,

Q⇡k+1(s, a) = r(s, a) + �Es0⇠P (·|s,a)[Q⇡k+1(s0,⇡k+1(s

0))]

� r(s, a) + �Es0⇠P (·|s,a)[Q⇡k(s0,⇡k+1(s

0))]

= r(s, a) + �Es0⇠P (·|s,a)[maxa0

Q⇡k(s0, a0)] = T Q⇡k(s, a)

which completes the proof of the first claim.

For the second claim,

kQ? �Q⇡k+1k1 kQ? � T Q⇡kk1 = kT Q? � T Q⇡k+1k1 �kQ? �Q⇡kk1where we have used that Q? � Q⇡k+1 � Q⇡k in second step and the contraction property of T (·) (see Lemma 1.10 inthe last step.

With this lemma, a convergence rate for the policy iteration algorithm immediately follows.

Theorem 1.14. (Policy iteration convergence). Let ⇡0 be any initial policy. For k � log 1(1��)✏

1��, the k-th policy in

policy iteration has the following performance bound:

Q⇡(k)

� Q? � ✏1 .

14

23

1.4.2 Policy Iteration




⇡k+1 = ⇡Q⇡k



1. Q⇡k+1 � T Q⇡k � Q⇡k

2. kQ⇡k+1 �Q?k1 �kQ⇡k �Q?k1



Q⇡k(s0, a0)]

� r(s, a) + �Es0⇠P (·|s,a)[Q⇡k(s0,⇡k(s

0))] = Q⇡k(s, a).


Q⇡k = r + �P⇡kQ⇡k r + �P⇡k+1Q⇡k 1X

t=0

�t(P⇡k+1)tr = Q⇡k+1 .



0))]

� r(s, a) + �Es0⇠P (·|s,a)[Q⇡k(s0,⇡k+1(s

0))]

= r(s, a) + �Es0⇠P (·|s,a)[maxa0

Q⇡k(s0, a0)] = T Q⇡k(s, a)








Q⇡(k)

� Q? � ✏1 .

14

Initialize a policy π0 arbitrarily.for k= 1,2,3,4,…

1.4.2 Policy Iteration




⇡k+1 = ⇡Q⇡k



1. Q⇡k+1 � T Q⇡k � Q⇡k

2. kQ⇡k+1 �Q?k1 �kQ⇡k �Q?k1



Q⇡k(s0, a0)]

� r(s, a) + �Es0⇠P (·|s,a)[Q⇡k(s0,⇡k(s

0))] = Q⇡k(s, a).


Q⇡k = r + �P⇡kQ⇡k r + �P⇡k+1Q⇡k 1X

t=0

�t(P⇡k+1)tr = Q⇡k+1 .



0))]

� r(s, a) + �Es0⇠P (·|s,a)[Q⇡k(s0,⇡k+1(s

0))]

= r(s, a) + �Es0⇠P (·|s,a)[maxa0

Q⇡k(s0, a0)] = T Q⇡k(s, a)








Q⇡(k)

� Q? � ✏1 .

14

Computational complexity of these MDP solvers• VI:

• PI:

• LP:

24

Strongly polynomial algorithmsare independent to ε

25



1��(|S|3 + |S|2|A|)L(P,r,�) log 1

1��

1��|S|3|A|L(P, r, �)


⇢|A|

|S|

|S|,|S|

2|A| log |S|2

1��

1��

�|S|4|A|4 log |S|

1��










Q T Q ,







|Q(s, a)�Q0(s, a)|.

12

Next lecture

• Approximate / randomized solvers for MDP

• MDP / RL with generative models

26

Date post:	21-Nov-2021
Category:	Documents
Upload:	others
View:	12 times
Download:	0 times

CS292FStatRLLecture 2 Markov Decision Processes

Documents