Post on 16-Aug-2020
transcript
Curiosity,| unobservedrewards|andneuralnetworks inRL
CsabaSzepesváriDeepMind&UniversityofAlberta
NewDirectionsinRLandControlPrinceton2019
functionapproximation
PartI:Curiosity
“One of the striking differences between current reinforcement learning algorithms and early human learning is that animals and infants appear to explore their environments with autonomous purpose, in a manner appropriate to their current level of skills.”
ALT2011,invitedtalkbyPeter
ModelingCuriosity(ALW model)• Controlledprocess• Stochasticity:Makesthingsmoreinteresting/realistic• Countablymanystates,theyareobserved
– Simplifyingassumption– Hope:someoftheprinciples/algorithmstransfertothegeneralcase– Youhavetostartsomewhere
• Resettoaninitialstate– Necessary– Engineertheenvironmenttomakethishappen(robotmoms!)
• Goal:Extendthesetofreliablyreachablestatesasquicklyaspossible
Performancemetric• #Reliablyreachablestates/time• Fixanarbitrarypartialorder,≺, onstates
– Notknowntolearner..• Fix𝐿 > 0. Define𝒮'≺ asfollows:
– 𝑠) ∈ 𝒮'≺
– 𝑠 ∈ 𝒮'≺ if∃𝜋 on {𝑠. ≺ 𝑠: 𝑠. ∈ 𝒮'≺} s.t. 𝜏 𝑠 𝜋 ≤ 𝐿
• Define:𝒮'→ =∪≺ 𝒮'≺.
• Note:Simplerdefinitionsdon’twork(counterexamples).
• Prop:∃≺ s.t. 𝒮'→ = 𝒮'≺ and𝒮'→ isfinite.
UCBExplore
Knownstates
1. Discover2. Propose3. Verify
LimandAuer(COLT2012)
Mainresult
Anytime,continuallearningversion:
LimandAuer(COLT2012)
Nonstationarity
w.Auer,Gajane,Ortner (2019)
Performancemetric• 𝐹:numberoftimesthetransitionprobabilitieschange
– (t=1:alwaysachange)
• 𝑊(𝐿) timestepstofindall𝐿-reachablestatesinasingleMCP⇒ 𝐹𝑊(𝐿) timestepswhenthereare𝐹 changes
• Classificationoftimesteps:Alg hascorrectknowledgeofwhatisreachable;ornot.Alg iscompetent vsincompetent
• Goal:Minimizethe#timestepswhenAlg isincompetent
• Difficulty:Thelocationandnumberofchangesisunknown
Mainideas• Twophases:– Buildsetofreachablestates𝒦 (UCBExplore)– Repeat:Checkfornewreachablestates(UCBExplore)ordisappearingstates(asinverificationphaseofUCBExplore)– breakoutwhenUCBExplore oftentakestoolongcomparedtopredictedruntime
• Checkingstartswhenbuildingisdone• Issueswithbuilding:
– Howcanthealg knowwhetherachangehappenedwhilebuilding?E.g.newstatewasfoundreachable.Beforechange,afterchange?
• Solution:Staggeredstartofmanyparallelbuildingprocesses.Quitbuildingwhenanyoftheprocessesfinishes.
w.Auer,Gajane,Ortner (2019)
Result
• Theorem:Uptolowerordertermsandlogfactors,thetotalnumberofstepswhenthealg isincompetentisatmost𝑊 𝐿 𝐹 = irrespectiveofwhenthechangeshappen.
• Questions:– Is𝑊 𝐿 costnecessarywithoutchanges?– Isthequadraticdependenceabovenecessary?– Nontabular?
w.Auer,Gajane,Ortner (2019)
PartII:Unobservedrewards• RL:rewardsarealwaysobserved
– internallycomputed– externallyprovided
• Isthisreasonable?• Istheenvironmentstateobservable?
• Whathappenswhenrewardsarenotobservable?• Consequencesfor:
– Planning– Learning⇒ exploration;whichwillneedplanning!
• Bandits:MPDsw.iid state• Partialmonitoring:POMPDBCw.iid state
w.TorLattimore(2019)
Interactinguser
Showwebpage(agent’schoice)
Userclicks(observed)
Userhappy(unobserved)
Environment,𝑋
Action,𝐴 Observation:𝑂 = Φ(𝐴, 𝑋)
Lossℒ(𝐴, 𝑋)(unobserved)
Regret: 𝑅K = maxO∑ ℒ 𝐴Q, 𝑋Q − ℒ(𝑎, 𝑋Q)KQTU
Learnerisgivenmapsℒ,Φ
Forrounds𝑡 = 1,2, … , 𝑛:
1. Environmentchoose𝑠𝑋Q ∈ 𝒳
2. Learnerchooses𝐴Q ∈ 𝒜
3. Learnerssufferslossℒ(𝐴Q, 𝑋Q) – whichremainshidden!
4. LearnerobservesfeedbackΦ 𝐴Q, 𝑋Q
PartialMonitoring
[Rustichini,1999]
Whygreat?
• InformalexamplesofPMproblems:– Dynamicpricing– Altruisticagents– Statisticaltesting(balancingpowerandcost)– Delayedrewards/surrogates
• Subsumesclassicframeworks:– finite-armedbandits– predictionwithexpertadvice– banditswithgraphfeedback– linearbandits– dueling bandits– …
15
Theorem:Let𝒜,𝒳 befinite.Let𝑅K∗ 𝐺 betheminimaxregretonPMproblemG = (ℒ,Φ).Then:
𝑅K∗ 𝐺 =
0 if𝐺hasnonbactionsΘ 𝑛� if𝐺isL. O. andhasnbactionsΘ(𝑛=/q) if𝐺isG. O. butnotL. O.Ω(𝑛) otherwise
PartialMonitoring–ClassificationTheorem
[Cesa-Bianchi, Lugosi, Stoltz, 2006; Bartók, Pál, Sz., 2011; Foster and Rakhlin, 2012; Antos, Bartók, Pál and Sz., 2013; Bartók, Foster, Pál, Rakhlin, Sz., 2014; Lattimore and Sz., 2019a].
Algorithms?
• Classicalapproachesfailinpartialmonitoring– Optimism/Thompson-sampling/exponentialweights
• Complicatedalgorithmsexist;nonearegood!
ExplorationbyOptimisation
Theory
• Single algorithmworksinall‘learnable’finitegames• Near-optimalforbandits,fullinformation,graphfeedback• Bestknownboundsingeneralcase• Essentiallynotuning;learningratetunedonline
Experiments
Experiments
Conlusions/Futureplans
• Itissometimesgoodtobeambitious!
• Moreexperimentsneeded• Howtosolvetheoptimizationproblem?Itisconvex!Butcost
isnot𝑂(𝑘) ..• Whathappenswhen𝒳 islargeorinfinite?• Generalizations?
– Addstate/context!Use“explorebyoptimization”beyondPM?
• Findmoreapplications?
PartIII:RL&generalization
• Theworldisbig• Needapproximate
models• Minimal
assumptionstomakeRL+Genwork?
• policyerror=f(approximationerrorof“model”)
3results:Generativemodelaccess/planningbysolvingareducedordermodelModel-basedRL:factoredlinearmodels– aconvenientmodelclassModel-freeRL
LRA:LinearlyRelaxedALP𝑐 ≥ 0, 1x𝑐 = 1
𝑊O ∈ 0,∞ z×|,𝜓 ∈ 0,∞ z
𝐽 �,� = max�
𝐽 𝑠𝜓 𝑠
𝛽� ≔ 𝛼maxO 𝑃O𝜓 �,� < 1
𝜓 ∈ span(Φ)
Theorem:Let𝜖 = infC∈ℝ�
𝐽∗ − Φ𝑟 �,�,𝐽��� = Φ𝑟��� ,where𝑟���isthesolutiontotheaboveLP.Then,underthesaidassumptions,
𝐽∗ − 𝐽��� U,� ≤2𝑐x𝜓1 − 𝛽�
(3𝜖 + 𝐽���∗ − 𝐽���∗�,�)
8
0 200 400 600 800
�800
�600
�400
�200
0
J⇤
JCS
JCS�ideal
JLRA
0 200 400 600 800 1,000
1
2
3
4
u⇤
uCSuCS�idealuLRA
Fig. 1. Results for a single-queue with polynomial features. On both figuresthe x axis represents the state space: the length of the queue. For more details,see the text.
[23]), exploring the idea of approximating dual variables anddesigning algorithms that use the newly derived results toactively compute what constraints to select.
REFERENCES
[1] D. J. White, “A survey of applications of Markovdecision processes,” The Journal of the OperationalResearch Society, vol. 44, no. 11, pp. 1073–1096, 1993.
[2] J. Rust, “Numerical dynamic programming in eco-nomics,” in Handbook of Computational Economics,vol. 1, Elsevier, North Holland, 1996, pp. 619–729.
[3] E. A. Feinberg and A. Shwartz, Handbook of Markovdecision processes: methods and applications. KluwerAcademic Publishers, 2002.
[4] Q. Hu and W. Yue, Markov Decision Processes withTheir Applications. Springer, 2007.
[5] O. Sigaud and O. Buffet, Eds., Markov Decision Pro-cesses in Artificial Intelligence. Wiley-ISTE, 2010.
[6] N. Bauerle and U. Rieder, Markov Decision Processeswith Applications to Finance. Springer, 2011.
[7] M. L. Puterman, Markov Decision Processes: DiscreteStochastic Programming. New York: John Wiley, 1994.
[8] F. L. Lewis and D. Liu, Eds., Reinforcement Learningand Approximate Dynamic Programming for FeedbackControl. Wiley-IEEE Press, 2012.
[9] M. Abu Alsheikh, D. T. Hoang, D. Niyato, H.-P. Tan,and S. Lin, “Markov decision processes with appli-cations in wireless sensor networks: A survey,” IEEEComm. Surveys & Tutorials, vol. 17, pp. 1239–1267,2015.
[10] R. J. Boucherie and N. M. van Dijk, Eds., Markov De-cision Processes in Practice. Springer, 2017, vol. 248.
[11] J. Rust, “Using randomization to break the curse ofdimensionality,” Econometrica, vol. 65, pp. 487–516,1996.
[12] Cs. Szepesvari, “Efficient approximate planning in con-tinuous space Markovian decision problems,” AI Com-munications, vol. 13, no. 3, pp. 163–176, 2001.
[13] M. Kearns, Y. Mansour, and A. Y. Ng, “A sparsesampling algorithm for near-optimal planning in largeMarkov decision processes,” Machine learning, vol. 49,pp. 193–208, 2002.
[14] V. D. Blondel and J. N. Tsitsiklis, “A survey of com-putational complexity results in systems and control,”Automatica, vol. 36, pp. 1249–1274, 2000.
[15] P. J. Schweitzer and A. Seidmann, “Generalized polyno-mial approximations in Markovian decision processes,”Journal of Mathematical Analysis and Applications,vol. 110, pp. 568–582, 1985.
[16] L. Kallenberg. (2017). Markov decision processes: Lec-ture notes, [Online]. Available: https://goo.gl/yhvrph.
[17] D. Schuurmans and R. Patrascu, “Direct value-approximation for factored MDPs,” in NIPS, 2001,pp. 1579–1586.
[18] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman,“Efficient solution algorithms for factored MDPs,” Jour-nal of Artificial Intelligence Research, vol. 19, pp. 399–468, 2003.
[19] D. P. de Farias and B. Van Roy, “The linear program-ming approach to approximate dynamic programming,”Operations Research, vol. 51, pp. 850–865, 2003.
[20] ——, “On constraint sampling in the linear program-ming approach to approximate dynamic programming,”Mathematics of Operations Research, vol. 29, pp. 462–478, 2004.
[21] B. Kveton and M. Hauskrecht, “Heuristic refine-ments of approximate linear programming for factoredcontinuous-state Markov decision processes,” in ICAPS,2004, pp. 306–314.
[22] M. Petrik and S. Zilberstein, “Constraint relaxation inapproximate linear programs,” in ICML, 2009, pp. 809–816.
[23] V. V. Desai, V. F. Farias, and C. C. Moallemi, “Asmoothed approximate linear program,” in NIPS, 2009,pp. 459–467.
[24] G. Taylor, M. Petrik, R. Parr, and S. Zilberstein, “Fea-ture selection using regularization in approximate linearprograms for Markov decision processes,” in ICML,2010, pp. 871–878.
[25] J. Pazis and R. Parr, “Non-parametric approximatelinear programming for MDPs,” in AAAI, 2011.
[26] N. Bhat, V. Farias, and C. C. Moallemi, “Non-parametric approximate dynamic programming via thekernel method,” in NIPS, 2012, pp. 386–394.
[27] Y. Abbasi-Yadkori, P. Bartlett, and A. Malek, “Linearprogramming for large-scale Markov decision prob-lems,” in ICML, 2014, pp. 496–504.
[28] C. Lakshminarayanan and S. Bhatnagar, “A gener-alized reduced linear program for Markov DecisionProcesses,” in AAAI, 2015, pp. 2722–2728.
[29] R.-R. Chen and S. P. Meyn, “Value iteration and op-timization of multiclass queueing networks,” QueueingSystems, vol. 32, no. 1-3, pp. 65–97, 1999.
8
0 200 400 600 800
�800
�600
�400
�200
0
J⇤
JCS
JCS�ideal
JLRA
0 200 400 600 800 1,000
1
2
3
4
u⇤
uCSuCS�idealuLRA
Fig. 1. Results for a single-queue with polynomial features. On both figuresthe x axis represents the state space: the length of the queue. For more details,see the text.
[23]), exploring the idea of approximating dual variables anddesigning algorithms that use the newly derived results toactively compute what constraints to select.
REFERENCES
[1] D. J. White, “A survey of applications of Markovdecision processes,” The Journal of the OperationalResearch Society, vol. 44, no. 11, pp. 1073–1096, 1993.
[2] J. Rust, “Numerical dynamic programming in eco-nomics,” in Handbook of Computational Economics,vol. 1, Elsevier, North Holland, 1996, pp. 619–729.
[3] E. A. Feinberg and A. Shwartz, Handbook of Markovdecision processes: methods and applications. KluwerAcademic Publishers, 2002.
[4] Q. Hu and W. Yue, Markov Decision Processes withTheir Applications. Springer, 2007.
[5] O. Sigaud and O. Buffet, Eds., Markov Decision Pro-cesses in Artificial Intelligence. Wiley-ISTE, 2010.
[6] N. Bauerle and U. Rieder, Markov Decision Processeswith Applications to Finance. Springer, 2011.
[7] M. L. Puterman, Markov Decision Processes: DiscreteStochastic Programming. New York: John Wiley, 1994.
[8] F. L. Lewis and D. Liu, Eds., Reinforcement Learningand Approximate Dynamic Programming for FeedbackControl. Wiley-IEEE Press, 2012.
[9] M. Abu Alsheikh, D. T. Hoang, D. Niyato, H.-P. Tan,and S. Lin, “Markov decision processes with appli-cations in wireless sensor networks: A survey,” IEEEComm. Surveys & Tutorials, vol. 17, pp. 1239–1267,2015.
[10] R. J. Boucherie and N. M. van Dijk, Eds., Markov De-cision Processes in Practice. Springer, 2017, vol. 248.
[11] J. Rust, “Using randomization to break the curse ofdimensionality,” Econometrica, vol. 65, pp. 487–516,1996.
[12] Cs. Szepesvari, “Efficient approximate planning in con-tinuous space Markovian decision problems,” AI Com-munications, vol. 13, no. 3, pp. 163–176, 2001.
[13] M. Kearns, Y. Mansour, and A. Y. Ng, “A sparsesampling algorithm for near-optimal planning in largeMarkov decision processes,” Machine learning, vol. 49,pp. 193–208, 2002.
[14] V. D. Blondel and J. N. Tsitsiklis, “A survey of com-putational complexity results in systems and control,”Automatica, vol. 36, pp. 1249–1274, 2000.
[15] P. J. Schweitzer and A. Seidmann, “Generalized polyno-mial approximations in Markovian decision processes,”Journal of Mathematical Analysis and Applications,vol. 110, pp. 568–582, 1985.
[16] L. Kallenberg. (2017). Markov decision processes: Lec-ture notes, [Online]. Available: https://goo.gl/yhvrph.
[17] D. Schuurmans and R. Patrascu, “Direct value-approximation for factored MDPs,” in NIPS, 2001,pp. 1579–1586.
[18] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman,“Efficient solution algorithms for factored MDPs,” Jour-nal of Artificial Intelligence Research, vol. 19, pp. 399–468, 2003.
[19] D. P. de Farias and B. Van Roy, “The linear program-ming approach to approximate dynamic programming,”Operations Research, vol. 51, pp. 850–865, 2003.
[20] ——, “On constraint sampling in the linear program-ming approach to approximate dynamic programming,”Mathematics of Operations Research, vol. 29, pp. 462–478, 2004.
[21] B. Kveton and M. Hauskrecht, “Heuristic refine-ments of approximate linear programming for factoredcontinuous-state Markov decision processes,” in ICAPS,2004, pp. 306–314.
[22] M. Petrik and S. Zilberstein, “Constraint relaxation inapproximate linear programs,” in ICML, 2009, pp. 809–816.
[23] V. V. Desai, V. F. Farias, and C. C. Moallemi, “Asmoothed approximate linear program,” in NIPS, 2009,pp. 459–467.
[24] G. Taylor, M. Petrik, R. Parr, and S. Zilberstein, “Fea-ture selection using regularization in approximate linearprograms for Markov decision processes,” in ICML,2010, pp. 871–878.
[25] J. Pazis and R. Parr, “Non-parametric approximatelinear programming for MDPs,” in AAAI, 2011.
[26] N. Bhat, V. Farias, and C. C. Moallemi, “Non-parametric approximate dynamic programming via thekernel method,” in NIPS, 2012, pp. 386–394.
[27] Y. Abbasi-Yadkori, P. Bartlett, and A. Malek, “Linearprogramming for large-scale Markov decision prob-lems,” in ICML, 2014, pp. 496–504.
[28] C. Lakshminarayanan and S. Bhatnagar, “A gener-alized reduced linear program for Markov DecisionProcesses,” in AAAI, 2015, pp. 2722–2728.
[29] R.-R. Chen and S. P. Meyn, “Value iteration and op-timization of multiclass queueing networks,” QueueingSystems, vol. 32, no. 1-3, pp. 65–97, 1999.
8
0 200 400 600 800
�800
�600
�400
�200
0
J⇤
JCS
JCS�ideal
JLRA
0 200 400 600 800 1,000
1
2
3
4
u⇤
uCSuCS�idealuLRA
Fig. 1. Results for a single-queue with polynomial features. On both figuresthe x axis represents the state space: the length of the queue. For more details,see the text.
[23]), exploring the idea of approximating dual variables anddesigning algorithms that use the newly derived results toactively compute what constraints to select.
REFERENCES
[1] D. J. White, “A survey of applications of Markovdecision processes,” The Journal of the OperationalResearch Society, vol. 44, no. 11, pp. 1073–1096, 1993.
[2] J. Rust, “Numerical dynamic programming in eco-nomics,” in Handbook of Computational Economics,vol. 1, Elsevier, North Holland, 1996, pp. 619–729.
[3] E. A. Feinberg and A. Shwartz, Handbook of Markovdecision processes: methods and applications. KluwerAcademic Publishers, 2002.
[4] Q. Hu and W. Yue, Markov Decision Processes withTheir Applications. Springer, 2007.
[5] O. Sigaud and O. Buffet, Eds., Markov Decision Pro-cesses in Artificial Intelligence. Wiley-ISTE, 2010.
[6] N. Bauerle and U. Rieder, Markov Decision Processeswith Applications to Finance. Springer, 2011.
[7] M. L. Puterman, Markov Decision Processes: DiscreteStochastic Programming. New York: John Wiley, 1994.
[8] F. L. Lewis and D. Liu, Eds., Reinforcement Learningand Approximate Dynamic Programming for FeedbackControl. Wiley-IEEE Press, 2012.
[9] M. Abu Alsheikh, D. T. Hoang, D. Niyato, H.-P. Tan,and S. Lin, “Markov decision processes with appli-cations in wireless sensor networks: A survey,” IEEEComm. Surveys & Tutorials, vol. 17, pp. 1239–1267,2015.
[10] R. J. Boucherie and N. M. van Dijk, Eds., Markov De-cision Processes in Practice. Springer, 2017, vol. 248.
[11] J. Rust, “Using randomization to break the curse ofdimensionality,” Econometrica, vol. 65, pp. 487–516,1996.
[12] Cs. Szepesvari, “Efficient approximate planning in con-tinuous space Markovian decision problems,” AI Com-munications, vol. 13, no. 3, pp. 163–176, 2001.
[13] M. Kearns, Y. Mansour, and A. Y. Ng, “A sparsesampling algorithm for near-optimal planning in largeMarkov decision processes,” Machine learning, vol. 49,pp. 193–208, 2002.
[14] V. D. Blondel and J. N. Tsitsiklis, “A survey of com-putational complexity results in systems and control,”Automatica, vol. 36, pp. 1249–1274, 2000.
[15] P. J. Schweitzer and A. Seidmann, “Generalized polyno-mial approximations in Markovian decision processes,”Journal of Mathematical Analysis and Applications,vol. 110, pp. 568–582, 1985.
[16] L. Kallenberg. (2017). Markov decision processes: Lec-ture notes, [Online]. Available: https://goo.gl/yhvrph.
[17] D. Schuurmans and R. Patrascu, “Direct value-approximation for factored MDPs,” in NIPS, 2001,pp. 1579–1586.
[18] C. Guestrin, D. Koller, R. Parr, and S. Venkataraman,“Efficient solution algorithms for factored MDPs,” Jour-nal of Artificial Intelligence Research, vol. 19, pp. 399–468, 2003.
[19] D. P. de Farias and B. Van Roy, “The linear program-ming approach to approximate dynamic programming,”Operations Research, vol. 51, pp. 850–865, 2003.
[20] ——, “On constraint sampling in the linear program-ming approach to approximate dynamic programming,”Mathematics of Operations Research, vol. 29, pp. 462–478, 2004.
[21] B. Kveton and M. Hauskrecht, “Heuristic refine-ments of approximate linear programming for factoredcontinuous-state Markov decision processes,” in ICAPS,2004, pp. 306–314.
[22] M. Petrik and S. Zilberstein, “Constraint relaxation inapproximate linear programs,” in ICML, 2009, pp. 809–816.
[23] V. V. Desai, V. F. Farias, and C. C. Moallemi, “Asmoothed approximate linear program,” in NIPS, 2009,pp. 459–467.
[24] G. Taylor, M. Petrik, R. Parr, and S. Zilberstein, “Fea-ture selection using regularization in approximate linearprograms for Markov decision processes,” in ICML,2010, pp. 871–878.
[25] J. Pazis and R. Parr, “Non-parametric approximatelinear programming for MDPs,” in AAAI, 2011.
[26] N. Bhat, V. Farias, and C. C. Moallemi, “Non-parametric approximate dynamic programming via thekernel method,” in NIPS, 2012, pp. 386–394.
[27] Y. Abbasi-Yadkori, P. Bartlett, and A. Malek, “Linearprogramming for large-scale Markov decision prob-lems,” in ICML, 2014, pp. 496–504.
[28] C. Lakshminarayanan and S. Bhatnagar, “A gener-alized reduced linear program for Markov DecisionProcesses,” in AAAI, 2015, pp. 2722–2728.
[29] R.-R. Chen and S. P. Meyn, “Value iteration and op-timization of multiclass queueing networks,” QueueingSystems, vol. 32, no. 1-3, pp. 65–97, 1999.
IEEETAC,2018w.Chandrashekar L.&Sh.Bhatnagar
minC∈ℝ�
𝑐xΦ𝑟 s.t.∑ 𝑊OxΦ𝑟 ≥ ∑ 𝑊Ox(𝑔O + 𝛼𝑃OΦ𝑟)�
O�O
𝐽���∗ 𝑠 = min{𝑟x𝜙 𝑠 ∶ Φ𝑟 ≥ 𝐽∗, 𝑟 ∈ ℝ�}𝐽���∗ 𝑠 = min{𝑟x𝜙 𝑠 ∶ 𝑊x𝐸Φ𝑟 ≥ 𝑊x𝐸𝐽∗, 𝑟 ∈ ℝ�}
Model-basedRL
Good?Bad?Bonusquestion:Can‖𝑉∗ − 𝑉��‖ becontrolledvia controlling‖𝑉� − 𝑉∗‖?
Canwedobetter?Perhapsusingextrastructure?
w. Bernardo ÁVILA PIRES COLT 2016
Structure:Factoredlinearmodels
w. Bernardo ÁVILA PIRES COLT 2016/semi/linear/PhD Th.
𝒫 𝑑𝑥. 𝑥, 𝑎 ≈ 𝜉 𝑑𝑥′ x𝜓(𝑥, 𝑎)
ℛ𝑉 = ∫ 𝑉 𝑥. 𝜉 𝑑𝑥. (= 𝑤) ∈ ℝ¦
(𝒬𝑤)(𝑥, 𝑎) = 𝑤x𝜓(𝑥, 𝑎)
ℛ: VFUN → ℝ¦
𝒬:ℝ¦ → AVFUN
𝒫: VFUN → AVFUN 𝒫𝑉 𝑥, 𝑎 = ∫ 𝑉 𝑥. 𝒫(𝑑𝑥.|𝑥, 𝑎)
𝒫 ≈ 𝒬ℛSpecialcases:• Tabular• LinearMDP• KME• Stoch.Fact.• KBRL• ..
Legend:𝒱 = VFUN
𝒲 = ℝℐ = CVFUN𝒱𝒜 = AVFUN
Policyerrorinfactoredlinearmodels
𝑈∗ = 𝑀𝑇𝒬𝑢∗𝑢∗ = 𝑀.𝑇ℛ𝒜𝒬𝑢∗
𝜋¶ = 𝐺𝑇𝒬𝑢∗
w. Bernardo ÁVILA PIRES COLT 2016/semi/linear/PhD Th.
Questions:• Istheboundtight?• Time/actionabstraction?• Efficientlearning?Whatspecific
modelsto use?
Online,model-freeRLw.neuralnets
• ContinuingRL;𝑅·¸:pseudoregret;let𝑄Q ≔ 𝑄� º .• Keyidentity:
𝑅·¸ =»𝜈�∗(𝑥)�
½
» 𝑄Q 𝑥,⋅ , 𝜋 Q ⋅ 𝑥 − ⟨𝑄Q 𝑥,⋅ , 𝜋∗ ⋅ 𝑥 ⟩¸
QTU
• Then..
𝑄Q 𝑥,⋅ , 𝜋 Q ⋅ 𝑥 − 𝑄Q 𝑥,⋅ , 𝜋∗ ⋅ 𝑥 =𝑄ÁQ 𝑥,⋅ , 𝜋 Q ⋅ 𝑥 − 𝑄ÁQ 𝑥,⋅ , 𝜋∗ ⋅ 𝑥 ⇒ Controlw. OLP
+ 𝑄Q 𝑥,⋅ , 𝜋 Q ⋅ 𝑥 − 𝑄ÁQ 𝑥,⋅ , 𝜋 Q ⋅ 𝑥 ⇒ A:𝐿U(𝜈�∗ ⊗ 𝜋 Q )+ 𝑄ÁQ 𝑥,⋅ , 𝜋∗ ⋅ 𝑥 − ⟨𝑄Q 𝑥,⋅ , 𝜋∗ ⋅ 𝑥 ⟩ ⇒ A:𝐿U(𝜈�∗ ⊗ 𝜋∗)
w. N. Lazic, Y. Abbasi-Yadkori, G. Weisz, ICML’19 + arXiv
Politex
w. N. Lazic, Y. Abbasi-Yadkori, G. Weisz, ICML’19 + arXiv
Regretbounds
w. N. Lazic, Y. Abbasi-Yadkori, G. Weisz, ICML’19 + arXiv
Theorem
Assumethatforanypolicy𝜋,afterfollowing𝜋 for𝑛 steps,ablack-boxfunctionapproximator producesanaction-valuefunctionwhoseerroris𝜖) + 1/ 𝑛� uptosomeuniversalconstant.
Thentheaveragepseudo-regretofPolitex after𝑇 stepsis𝜖) + 𝑇BÄÅ.
Refinements
• Problem:Howtogetthe𝜖) +UK�error?
– E.g.linearVFA?LSPE!𝜖):limitingerrorofLSPEcouldbe≫ besterror.
• Refinement1:– Useon-policystatevaluefunction-approximator– addextraaction-ditheringperstate– assumeallpoliciesexcitestate-features
• Refinement2:– Assumeaccesstoan“explorationpolicy”thatexcitesfeatures– Interleaveexplorationstepswithpolicysteps– Useoff-policy(!)VFA(whichone?)– ⇒ Regretdegradesabit
• Questions:– CanwedobetterwithotherOLmethods?Isaveragingreallynecessary?– Bettervalue-functionlearners?
w. N. Lazic, Y. Abbasi-Yadkori, G. Weisz, ICML’19 + arXiv
Summary