Date post: | 13-Dec-2015 |
Category: |
Documents |
Upload: | charlene-lambert |
View: | 217 times |
Download: | 0 times |
1
Structural Return Maximization for Reinforcement Learning
Josh JosephAlborz Geramifard
Javier Velez Jonathan HowNicholas Roy
2
How should we act in the presence of complex, unknown dynamics?
3
How should we act in the presence of complex, unknown dynamics?
4
How should we act in the presence of complex, unknown dynamics?
5
How should we act in the presence of complex, unknown dynamics?
6
What do I mean by complex dynamics?
• Can’t derive from first principles / intuition• Any dynamics model will be approximate• Limited data– Otherwise just do nearest neighbors
• Batch data– Trying to keep it as simple as possible for now– Fairly straightforward to extend to active learning
7
What do I mean by complex dynamics?
• Can’t derive from first principles / intuition• Any dynamics model will be approximate• Limited data• Batch data– Fairly straightforward to extend to active learning
8
How does RL solve these problems?
• Assume some representation class for:– Dynamics model– Value function– Policy
• Collect some data• Find the “best” representation based on the
data
9
How does RL solve these problems?
• Assume some representation class for:– Dynamics model– Value function– Policy
• Collect some data• Find the “best” representation based on the
data
10
• The “best” representation based on the data
• This defines the best policy…not the best representation
Value (return)
How does RL solve these problems?
Policy
Starting state
reward unknown dynamics model
11
• The “best” representation based on the data
• This defines the best policy…not the best representation
Value (return)
How does RL solve these problems?
Policy
Starting state
reward unknown dynamics model
12
• The “best” representation based on the data
• This defines the best policy…not the best representation
Value (return)
How does RL solve these problems?
Policy
Starting state
reward unknown dynamics model
13
…but does RL actually solve this problem?
• Policy Search– Policy directly parameterized by
14
…but does RL actually solve this problem?
• Policy Search– Policy directly parameterized by
15
…but does RL actually solve this problem?
• Policy Search– Policy directly parameterized by
Number of episodes
Empirical estimate
16
…but does RL actually solve this problem?
• Policy Search– Policy directly parameterized by
Number of episodes
Empirical estimate
17
…but does RL actually solve this problem?
• Model-based RL– Dynamics model =
18
…but does RL actually solve this problem?
• Model-based RL– Dynamics model =
19
…but does RL actually solve this problem?
• Model-based RL– Dynamics model =
20
…but does RL actually solve this problem?
• Model-based RL– Dynamics model =
Maximizing likelihood != maximizing return
21
…but does RL actually solve this problem?
• Model-based RL– Dynamics model =
Maximizing likelihood != maximizing return
…similar story for value-based methods
22
ML model selection in RL
• So why do we do it?– It’s easy– It sometimes works really well– Intuitively it feels like finding the most likely model
should result in a high performing policy• Why does it fail?– Chooses an “average” model based on the data– Ignores reward function
• What do we do then?
23
ML model selection in RL
• So why do we do it?– It’s easy– It sometimes works really well– Intuitively it feels like finding the most likely model
should result in a high performing policy• Why does it fail?– Chooses an “average” model based on the data– Ignores reward function
• What do we do then?
24
ML model selection in RL
• So why do we do it?– It’s easy– It sometimes works really well– Intuitively it feels like finding the most likely model
should result in a high performing policy• Why does it fail?– Chooses an “average” model based on the data– Ignores reward function
• What do we do then?
25
Our Approach
• Model-based RL– Dynamics model =
26
Our Approach
• Model-based RL– Dynamics model =
Empirical estimate
27
Our Approach
• Model-based RL– Dynamics model =
Empirical estimate
28
Planning with Misspecified Model Classes
Us
29
Our Approach
• Model-based RL– Dynamics model =
Empirical estimate
30
Our Approach
• Model-based RL– Dynamics model =
Empirical estimate
We can do the same thing in a value-based setting.
31
…but
• We are indirectly choosing a policy representation
• The win of this indirect representation is that it can be “small”
• Small = less data?– Intuitively you’d think so– Empirical evidence from toy problems
• But all of our guarantees rely on infinite data• …maybe there’s a way to be more concrete
32
…but
• We are indirectly choosing a policy representation
• The win of this indirect representation is that it can be “small”
• Small = less data?– Intuitively you’d think so– Empirical evidence from toy problems
• But all of our guarantees rely on infinite data• …maybe there’s a way to be more concrete
33
What we want
• How does the representation space relate to true return?
• …they’ve been doing this in classification since the 60s– Relationship between the bound and “size” of the
representation space / amount of data
≈?
34
What we want
• How does the representation space relate to true return?
• …they’ve been doing this in classification since the 60s– Relationship between the bound and “size” of the
representation space / amount of data
≈?
35
What we want
• How does the representation space relate to true return?
• …they’ve been doing this in classification since the 60s– Relationship between the “size” of the
representation space and the amount of data
≈?
36
How to get there
Model-based, value-based, policy search
37
How to get there
Model-based, value-based, policy search
Map RL to classification Empirical Risk Minimization
38
How to get there
Model-based, value-based, policy search
Map RL to classification Empirical Risk Minimization
Measuring function class size Bound on true risk
39
How to get there
Model-based, value-based, policy search
Map RL to classification Empirical Risk Minimization
Measuring function class size Bound on true risk
40
How to get there
Model-based, value-based, policy search
Map RL to classification Empirical Risk Minimization
Measuring function class size Bound on true risk
Structure of function classes Structural risk minimization
41
How to get there
Model-based, value-based, policy search
Map RL to classification Empirical Risk Minimization
Measuring function class size Bound on true risk
Structure of function classes Structural risk minimization
42
Classification
43
Classification
44
Classification
f
𝑓 ([𝑥1𝑥2])=𝑠𝑖𝑔𝑛([𝜃1𝜃2]𝑇
[𝑥1𝑥2])
𝑥1
𝑥2
45
Classification
Risk
46
Classification
Loss (cost)
Risk Unknown datadistribution
47
Empirical Risk Minimization
Unknown datadistribution
48
Empirical Risk Minimization
Unknown datadistribution
Number of samples
Empirical estimate
49
Mapping RL to Classification
50
Mapping RL to Classification
51
Mapping RL to Classification
52
How to get there
Model-based, value-based, policy search
Map RL to classification Empirical Risk Minimization
Measuring function class size Bound on true risk
Structure of function classes Structural risk minimization
53
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide
54
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide
55
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide
56
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide
57
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide
58
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide
59
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide
𝑉𝐶𝐷𝑖𝑚()=3
60
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide• Magically, shattering (VC Dim) can be used to
bound true risk
61
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide• Magically, shattering (VC Dim) can be used to
bound true risk
62
Measuring the size of a function class:VC Dimension
• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly
decide• Magically, shattering (VC Dim) can be used to
bound true risk
63
For those of you familiar with statistical learning theory…
• VC Dim – Only known for a few function classes– Difficult to estimate, bound
• Rademacher complexity– Use the data to estimate the “volume” of the
function class– This volume can then be used in a similar bound
64
Measuring the size of a function class
• Now we can say concrete things about why we may prefer one representation over another with limited data
65
Measuring the size of a function class
• Now we can say concrete things about why we may prefer one representation over another with limited data
66
How to get there
Model-based, value-based, policy search
Map RL to classification Empirical Risk Minimization
Measuring function class size Bound on true risk
Structure of function classes Structural risk minimization
67
Empirical Risk Minimization
Unknown datadistribution
Number of samples
Empirical estimate
68
Empirical Risk Minimization and Limited Data
Unknown datadistribution
But if we have limited data we cannot expect small empirical risk to result in small true risk
Empirical estimate
69
Empirical Risk Minimization and Limited Data
• If the bound is large, we cannot expect small empirical risk to result in small true risk
• …so what do we do?• Choose the function class which minimizes the
bound!
70
Empirical Risk Minimization and Limited Data
• If the bound is large, we cannot expect small empirical risk to result in small true risk
• …so what do we do?• Choose the function class which minimizes the
bound!
71
Structural Risk Minimization
• Using a “structure” of function classes
• For N data, we choose the function class:
72
Structural Risk Minimization
• Using a “structure” of function classes
Many natural structures of policy classes!
73
Structural Risk Minimization
• Using a “structure” of function classes
• We choose the function class:
74
Is this Bayesian?
• Prior knowledge– Structure encodes prior knowledge
• Robust to over-fitting– Choose the function class based on risk bound
• No Bayes update• No assumptions about the true function lying
in the structure– Breaks most (all?) Bayesian nonparametrics
75
Is this Bayesian?
• Prior knowledge– Structure encodes prior knowledge
• Robust to over-fitting– Choose the function class based on risk bound
• No Bayes update• No assumptions about the true function lying
in the structure– Breaks most (all?) Bayesian nonparametrics
76
Is this Bayesian?
• Prior knowledge– Structure encodes prior knowledge
• Robust to over-fitting– Choose the function class based on risk bound
• No Bayes update• No assumption that the true function is
somewhere in the structure– Breaks most (all?) Bayesian nonparametrics
77
Contribution
• Classification to RL mapping• Transferred probabilistic bounds from
statistical learning theory to RL• Applied structural risk minimization to RL
78
Contribution
• Classification to RL mapping• Transferred probabilistic bounds from
statistical learning theory to RL• Applied structural risk minimization to RL
79
Backup Slides
80
From last time…
81
From last time…
{𝒎𝒄 ,𝒎𝒑 ,𝒍 }
82
From last time…
≈?
{𝒎𝒄 ,𝒎𝒑 ,𝒍 }
83
Measuring the size of a function class
• Rademacher complexity