+ All Categories
Home > Documents > Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez...

Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez...

Date post: 13-Dec-2015
Category:
Upload: charlene-lambert
View: 217 times
Download: 0 times
Share this document with a friend
83
Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1
Transcript
Page 1: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

1

Structural Return Maximization for Reinforcement Learning

Josh JosephAlborz Geramifard

Javier Velez Jonathan HowNicholas Roy

Page 2: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

2

How should we act in the presence of complex, unknown dynamics?

Page 3: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

3

How should we act in the presence of complex, unknown dynamics?

Page 4: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

4

How should we act in the presence of complex, unknown dynamics?

Page 5: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

5

How should we act in the presence of complex, unknown dynamics?

Page 6: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

6

What do I mean by complex dynamics?

• Can’t derive from first principles / intuition• Any dynamics model will be approximate• Limited data– Otherwise just do nearest neighbors

• Batch data– Trying to keep it as simple as possible for now– Fairly straightforward to extend to active learning

Page 7: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

7

What do I mean by complex dynamics?

• Can’t derive from first principles / intuition• Any dynamics model will be approximate• Limited data• Batch data– Fairly straightforward to extend to active learning

Page 8: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

8

How does RL solve these problems?

• Assume some representation class for:– Dynamics model– Value function– Policy

• Collect some data• Find the “best” representation based on the

data

Page 9: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

9

How does RL solve these problems?

• Assume some representation class for:– Dynamics model– Value function– Policy

• Collect some data• Find the “best” representation based on the

data

Page 10: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

10

• The “best” representation based on the data

• This defines the best policy…not the best representation

Value (return)

How does RL solve these problems?

Policy

Starting state

reward unknown dynamics model

Page 11: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

11

• The “best” representation based on the data

• This defines the best policy…not the best representation

Value (return)

How does RL solve these problems?

Policy

Starting state

reward unknown dynamics model

Page 12: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

12

• The “best” representation based on the data

• This defines the best policy…not the best representation

Value (return)

How does RL solve these problems?

Policy

Starting state

reward unknown dynamics model

Page 13: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

13

…but does RL actually solve this problem?

• Policy Search– Policy directly parameterized by

Page 14: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

14

…but does RL actually solve this problem?

• Policy Search– Policy directly parameterized by

Page 15: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

15

…but does RL actually solve this problem?

• Policy Search– Policy directly parameterized by

Number of episodes

Empirical estimate

Page 16: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

16

…but does RL actually solve this problem?

• Policy Search– Policy directly parameterized by

Number of episodes

Empirical estimate

Page 17: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

17

…but does RL actually solve this problem?

• Model-based RL– Dynamics model =

Page 18: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

18

…but does RL actually solve this problem?

• Model-based RL– Dynamics model =

Page 19: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

19

…but does RL actually solve this problem?

• Model-based RL– Dynamics model =

Page 20: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

20

…but does RL actually solve this problem?

• Model-based RL– Dynamics model =

Maximizing likelihood != maximizing return

Page 21: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

21

…but does RL actually solve this problem?

• Model-based RL– Dynamics model =

Maximizing likelihood != maximizing return

…similar story for value-based methods

Page 22: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

22

ML model selection in RL

• So why do we do it?– It’s easy– It sometimes works really well– Intuitively it feels like finding the most likely model

should result in a high performing policy• Why does it fail?– Chooses an “average” model based on the data– Ignores reward function

• What do we do then?

Page 23: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

23

ML model selection in RL

• So why do we do it?– It’s easy– It sometimes works really well– Intuitively it feels like finding the most likely model

should result in a high performing policy• Why does it fail?– Chooses an “average” model based on the data– Ignores reward function

• What do we do then?

Page 24: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

24

ML model selection in RL

• So why do we do it?– It’s easy– It sometimes works really well– Intuitively it feels like finding the most likely model

should result in a high performing policy• Why does it fail?– Chooses an “average” model based on the data– Ignores reward function

• What do we do then?

Page 25: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

25

Our Approach

• Model-based RL– Dynamics model =

Page 26: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

26

Our Approach

• Model-based RL– Dynamics model =

Empirical estimate

Page 27: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

27

Our Approach

• Model-based RL– Dynamics model =

Empirical estimate

Page 28: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

28

Planning with Misspecified Model Classes

Us

Page 29: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

29

Our Approach

• Model-based RL– Dynamics model =

Empirical estimate

Page 30: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

30

Our Approach

• Model-based RL– Dynamics model =

Empirical estimate

We can do the same thing in a value-based setting.

Page 31: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

31

…but

• We are indirectly choosing a policy representation

• The win of this indirect representation is that it can be “small”

• Small = less data?– Intuitively you’d think so– Empirical evidence from toy problems

• But all of our guarantees rely on infinite data• …maybe there’s a way to be more concrete

Page 32: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

32

…but

• We are indirectly choosing a policy representation

• The win of this indirect representation is that it can be “small”

• Small = less data?– Intuitively you’d think so– Empirical evidence from toy problems

• But all of our guarantees rely on infinite data• …maybe there’s a way to be more concrete

Page 33: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

33

What we want

• How does the representation space relate to true return?

• …they’ve been doing this in classification since the 60s– Relationship between the bound and “size” of the

representation space / amount of data

≈?

Page 34: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

34

What we want

• How does the representation space relate to true return?

• …they’ve been doing this in classification since the 60s– Relationship between the bound and “size” of the

representation space / amount of data

≈?

Page 35: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

35

What we want

• How does the representation space relate to true return?

• …they’ve been doing this in classification since the 60s– Relationship between the “size” of the

representation space and the amount of data

≈?

Page 36: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

36

How to get there

Model-based, value-based, policy search

Page 37: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

37

How to get there

Model-based, value-based, policy search

Map RL to classification Empirical Risk Minimization

Page 38: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

38

How to get there

Model-based, value-based, policy search

Map RL to classification Empirical Risk Minimization

Measuring function class size Bound on true risk

Page 39: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

39

How to get there

Model-based, value-based, policy search

Map RL to classification Empirical Risk Minimization

Measuring function class size Bound on true risk

Page 40: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

40

How to get there

Model-based, value-based, policy search

Map RL to classification Empirical Risk Minimization

Measuring function class size Bound on true risk

Structure of function classes Structural risk minimization

Page 41: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

41

How to get there

Model-based, value-based, policy search

Map RL to classification Empirical Risk Minimization

Measuring function class size Bound on true risk

Structure of function classes Structural risk minimization

Page 42: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

42

Classification

Page 43: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

43

Classification

Page 44: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

44

Classification

f

𝑓 ([𝑥1𝑥2])=𝑠𝑖𝑔𝑛([𝜃1𝜃2]𝑇

[𝑥1𝑥2])

𝑥1

𝑥2

Page 45: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

45

Classification

Risk

Page 46: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

46

Classification

Loss (cost)

Risk Unknown datadistribution

Page 47: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

47

Empirical Risk Minimization

Unknown datadistribution

Page 48: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

48

Empirical Risk Minimization

Unknown datadistribution

Number of samples

Empirical estimate

Page 49: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

49

Mapping RL to Classification

Page 50: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

50

Mapping RL to Classification

Page 51: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

51

Mapping RL to Classification

Page 52: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

52

How to get there

Model-based, value-based, policy search

Map RL to classification Empirical Risk Minimization

Measuring function class size Bound on true risk

Structure of function classes Structural risk minimization

Page 53: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

53

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide

Page 54: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

54

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide

Page 55: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

55

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide

Page 56: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

56

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide

Page 57: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

57

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide

Page 58: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

58

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide

Page 59: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

59

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide

𝑉𝐶𝐷𝑖𝑚()=3

Page 60: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

60

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide• Magically, shattering (VC Dim) can be used to

bound true risk

Page 61: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

61

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide• Magically, shattering (VC Dim) can be used to

bound true risk

Page 62: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

62

Measuring the size of a function class:VC Dimension

• Introduces a notion of “shattering”– I pick the inputs– You pick the labels– VC Dim = max number of points I can perfectly

decide• Magically, shattering (VC Dim) can be used to

bound true risk

Page 63: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

63

For those of you familiar with statistical learning theory…

• VC Dim – Only known for a few function classes– Difficult to estimate, bound

• Rademacher complexity– Use the data to estimate the “volume” of the

function class– This volume can then be used in a similar bound

Page 64: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

64

Measuring the size of a function class

• Now we can say concrete things about why we may prefer one representation over another with limited data

Page 65: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

65

Measuring the size of a function class

• Now we can say concrete things about why we may prefer one representation over another with limited data

Page 66: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

66

How to get there

Model-based, value-based, policy search

Map RL to classification Empirical Risk Minimization

Measuring function class size Bound on true risk

Structure of function classes Structural risk minimization

Page 67: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

67

Empirical Risk Minimization

Unknown datadistribution

Number of samples

Empirical estimate

Page 68: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

68

Empirical Risk Minimization and Limited Data

Unknown datadistribution

But if we have limited data we cannot expect small empirical risk to result in small true risk

Empirical estimate

Page 69: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

69

Empirical Risk Minimization and Limited Data

• If the bound is large, we cannot expect small empirical risk to result in small true risk

• …so what do we do?• Choose the function class which minimizes the

bound!

Page 70: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

70

Empirical Risk Minimization and Limited Data

• If the bound is large, we cannot expect small empirical risk to result in small true risk

• …so what do we do?• Choose the function class which minimizes the

bound!

Page 71: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

71

Structural Risk Minimization

• Using a “structure” of function classes

• For N data, we choose the function class:

Page 72: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

72

Structural Risk Minimization

• Using a “structure” of function classes

Many natural structures of policy classes!

Page 73: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

73

Structural Risk Minimization

• Using a “structure” of function classes

• We choose the function class:

Page 74: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

74

Is this Bayesian?

• Prior knowledge– Structure encodes prior knowledge

• Robust to over-fitting– Choose the function class based on risk bound

• No Bayes update• No assumptions about the true function lying

in the structure– Breaks most (all?) Bayesian nonparametrics

Page 75: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

75

Is this Bayesian?

• Prior knowledge– Structure encodes prior knowledge

• Robust to over-fitting– Choose the function class based on risk bound

• No Bayes update• No assumptions about the true function lying

in the structure– Breaks most (all?) Bayesian nonparametrics

Page 76: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

76

Is this Bayesian?

• Prior knowledge– Structure encodes prior knowledge

• Robust to over-fitting– Choose the function class based on risk bound

• No Bayes update• No assumption that the true function is

somewhere in the structure– Breaks most (all?) Bayesian nonparametrics

Page 77: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

77

Contribution

• Classification to RL mapping• Transferred probabilistic bounds from

statistical learning theory to RL• Applied structural risk minimization to RL

Page 78: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

78

Contribution

• Classification to RL mapping• Transferred probabilistic bounds from

statistical learning theory to RL• Applied structural risk minimization to RL

Page 79: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

79

Backup Slides

Page 80: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

80

From last time…

Page 81: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

81

From last time…

{𝒎𝒄 ,𝒎𝒑 ,𝒍 }

Page 82: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

82

From last time…

≈?

{𝒎𝒄 ,𝒎𝒑 ,𝒍 }

Page 83: Structural Return Maximization for Reinforcement Learning Josh Joseph Alborz Geramifard Javier Velez Jonathan How Nicholas Roy 1.

83

Measuring the size of a function class

• Rademacher complexity


Recommended