+ All Categories
Home > Documents > Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510...

Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510...

Date post: 21-May-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
47
Dueling DQN & Prioritised Experience Reply Alina Vereshchaka CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October 10, 2019 *Slides are based on paper by Wang, Ziyu, et al. "Dueling network architectures for deep reinforcement learning." (2015) Schaul, Tom, et al. "Prioritized experience replay." (2015) Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 1 / 31
Transcript
Page 1: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Dueling DQN & Prioritised Experience Reply

Alina Vereshchaka

CSE4/510 Reinforcement LearningFall 2019

[email protected]

October 10, 2019

*Slides are based on paper by Wang, Ziyu, et al. "Dueling network architectures for deep reinforcementlearning." (2015)

Schaul, Tom, et al. "Prioritized experience replay." (2015)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 1 / 31

Page 2: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Overview

1 Recap: DQN

2 Recap: Double DQN

3 Dueling DQN

4 Prioritized Experience Replay (PER)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 2 / 31

Page 3: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Table of Contents

1 Recap: DQN

2 Recap: Double DQN

3 Dueling DQN

4 Prioritized Experience Replay (PER)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 3 / 31

Page 4: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Recap: Deep Q-Networks (DQN)

Represent value function by deep Q-network with weights w

Q(s, a,w) ≈ Qπ(s, a)

Define objective function

Leading to the following Q-leaning gradient

Optimize objective end-to-end by SGD, using ∂L(w)∂w

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 4 / 31

Page 5: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Deep Q-Network (DQN) Architecture

Naive DQN Optimized DQN used by DeepMind

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 5 / 31

Page 6: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

DQN Algorithm

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 6 / 31

Page 7: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Table of Contents

1 Recap: DQN

2 Recap: Double DQN

3 Dueling DQN

4 Prioritized Experience Replay (PER)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 7 / 31

Page 8: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Double Q-learning

Two estimators:

Estimator Q1: Obtain best actions

Estimator Q2: Evaluate Q for the above action

Q1(s, a)← Q1(s, a) + α(Target− Q1(s, a))

Q Target: r(s, a) + γmaxa′ Q1(s′, a′)

Double Q Target: r(s, a) + γQ2(s′, argmaxa′(Q1(s

′, a′)))

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 8 / 31

Page 9: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Double Q-learning

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 9 / 31

Page 10: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Double Deep Q Network

Two estimators:

Estimator Q1: Obtain best actions

Estimator Q2: Evaluate Q for the above action

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 10 / 31

Page 11: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Double Deep Q Network

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 11 / 31

Page 12: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Table of Contents

1 Recap: DQN

2 Recap: Double DQN

3 Dueling DQN

4 Prioritized Experience Replay (PER)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 12 / 31

Page 13: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Dueling DQN

What is Q-values tells us?

How good it is to be at state s and taking an action aat that state Q(s, a).

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 13 / 31

Page 14: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Dueling DQN

What is Q-values tells us?How good it is to be at state s and taking an action aat that state Q(s, a).

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 13 / 31

Page 15: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Advantage Function A(s, a)

A(s, a) = Q(s, a)− V (s)

If A(s, a) > 0: our gradient is pushed in that direction

If A(s, a) < 0 (our action does worse than the average value of that state) our gradient ispushed in the opposite direction

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 14 / 31

Page 16: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Dueling DQN

How can we decompose Qπ(s, a)?

Qπ(s, a) =

V π(s) + Aπ(s, a)

V π(s) = Ea∼π(s)[Qπ(s, a)]

In Dueling DQN, we separate the estimator of these two elements, using two new streams:

one estimates the state value V (s)

one estimates the advantage for each action A(s, a)

Networks that separately computes the advantage and value functions, and combines back intoa single Q-function at the final layer.

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 15 / 31

Page 17: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Dueling DQN

How can we decompose Qπ(s, a)?

Qπ(s, a) = V π(s) + Aπ(s, a)

V π(s) =

Ea∼π(s)[Qπ(s, a)]

In Dueling DQN, we separate the estimator of these two elements, using two new streams:

one estimates the state value V (s)

one estimates the advantage for each action A(s, a)

Networks that separately computes the advantage and value functions, and combines back intoa single Q-function at the final layer.

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 15 / 31

Page 18: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Dueling DQN

How can we decompose Qπ(s, a)?

Qπ(s, a) = V π(s) + Aπ(s, a)

V π(s) = Ea∼π(s)[Qπ(s, a)]

In Dueling DQN, we separate the estimator of these two elements, using two new streams:

one estimates the state value V (s)

one estimates the advantage for each action A(s, a)

Networks that separately computes the advantage and value functions, and combines back intoa single Q-function at the final layer.

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 15 / 31

Page 19: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Dueling DQN

How can we decompose Qπ(s, a)?

Qπ(s, a) = V π(s) + Aπ(s, a)

V π(s) = Ea∼π(s)[Qπ(s, a)]

In Dueling DQN, we separate the estimator of these two elements, using two new streams:

one estimates the state value V (s)

one estimates the advantage for each action A(s, a)

Networks that separately computes the advantage and value functions, and combines back intoa single Q-function at the final layer.

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 15 / 31

Page 20: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Dueling DQN

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 16 / 31

Page 21: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Dueling DQN

One stream of fully-connected layers output a scalar V (s; θ, β)

Other stream output an |A|-dimensional vector A(s, a; θ, α)

Here, θ denotes the parameters of the convolutional layers, while α and β are the parameters ofthe two streams of fully-connected layers.

Q(s, a; θ, α, β) = V (s; θ, β) + A(s, a; θ, α)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 17 / 31

Page 22: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Dueling DQN

Q(s, a; θ, α, β) = V (s; θ, β) + A(s, a; θ, α)

Problem: Equation is unidentifiable → given Q we cannot recover V and A uniquely → poorpractical performance.

Solutions:

1 Force the advantage function estimator to have zero advantage at the chosen action

Q(s, a; θ, α, β) = V (s; θ, β) +(A(s, a; θ, α)− max

a′∈|A|A(s, a′; θ, α)

)

a∗ = argmaxa′∈A

Q(s, a′; θ, α, β)

= argmaxa′∈A

A(s, a′; θ, α)

Q(s, a∗; θ, α, β) = V (s; θ, β)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 18 / 31

Page 23: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Dueling DQN

Q(s, a; θ, α, β) = V (s; θ, β) + A(s, a; θ, α)

Problem: Equation is unidentifiable → given Q we cannot recover V and A uniquely → poorpractical performance.

Solutions:

1 Force the advantage function estimator to have zero advantage at the chosen action

Q(s, a; θ, α, β) = V (s; θ, β) +(A(s, a; θ, α)− max

a′∈|A|A(s, a′; θ, α)

)a∗ = argmax

a′∈AQ(s, a′; θ, α, β)

= argmaxa′∈A

A(s, a′; θ, α)

Q(s, a∗; θ, α, β) = V (s; θ, β)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 18 / 31

Page 24: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Dueling DQN

Q(s, a; θ, α, β) = V (s; θ, β) + A(s, a; θ, α)

Problem: Equation is unidentifiable → given Q we cannot recover V and A uniquely → poorpractical performance.

Solutions:

1 Force the advantage function estimator to have zero advantage at the chosen action

Q(s, a; θ, α, β) = V (s; θ, β) +(A(s, a; θ, α)− max

a′∈|A|A(s, a′; θ, α)

)a∗ = argmax

a′∈AQ(s, a′; θ, α, β)

= argmaxa′∈A

A(s, a′; θ, α)

Q(s, a∗; θ, α, β) = V (s; θ, β)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 18 / 31

Page 25: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Dueling DQN

Q(s, a; θ, α, β) = V (s; θ, β) + A(s, a; θ, α)

Problem: Equation is unidentifiable → given Q we cannot recover V and A uniquely → poorpractical performance.

Solutions:

1 Force the advantage function estimator to have zero advantage at the chosen action

Q(s, a; θ, α, β) = V (s; θ, β) +(A(s, a; θ, α)− max

a′∈|A|A(s, a′; θ, α)

)a∗ = argmax

a′∈AQ(s, a′; θ, α, β)

= argmaxa′∈A

A(s, a′; θ, α)

Q(s, a∗; θ, α, β) = V (s; θ, β)Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 18 / 31

Page 26: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Dueling DQN

Q(s, a; θ, α, β) = V (s; θ, β) + A(s, a; θ, α)

Problem: Equation is unidentifiable → given Q we cannot recover V and A uniquely → poorpractical performance.

Solutions:

2 Replaces the max operator with an average

Q(s, a; θ, α, β) = V (s; θ, β) +(A(s, a; θ, α)− 1

|A|∑a′

A(s, a′; θ, α))

It increases the stability of the optimization: the advantages only need to change as fastas the mean, instead of having to compensate any change.

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 19 / 31

Page 27: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Dueling DQN: Example

Value and advantage saliency maps for two different timesteps

Leftmost pair - the value network stream paysattention to the road and the score.The advantage stream does not pay much attentionto the visual input because its action choice ispractically irrelevant when there are no cars in front.Rightmost pair - the advantage stream paysattention as there is a car immediately in front,making its choice of action very relevant.

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 20 / 31

Page 28: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Dueling DQN: Summary

Intuitively, the dueling architecture can learn which states are (or are not) valuable,without having to learn the effect of each action for each state.

The dueling architecture represents both the value V (s) and advantage A(s, a) functionswith a single deep model whose output combines the two to produce a state-action valueQ(s, a).

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 21 / 31

Page 29: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Table of Contents

1 Recap: DQN

2 Recap: Double DQN

3 Dueling DQN

4 Prioritized Experience Replay (PER)

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 22 / 31

Page 30: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Recap: Experience replay

Problem: Online RL agents incrementally update their parameters while they observe a streamof experience. In their simplest form, they discard incoming data immediately, after a singleupdate. Two issues are

1 Strongly correlated updates that break the i.i.d. assumption

2 Rapid forgetting of possibly rare experiences that would be useful later on.

Solution: Experience replay

Break the temporal correlations by mixing more and less recent experience for the updates

Rare experience will be used for more than just a single update

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 23 / 31

Page 31: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Recap: Experience replay

Problem: Online RL agents incrementally update their parameters while they observe a streamof experience. In their simplest form, they discard incoming data immediately, after a singleupdate. Two issues are

1 Strongly correlated updates that break the i.i.d. assumption

2 Rapid forgetting of possibly rare experiences that would be useful later on.

Solution: Experience replay

Break the temporal correlations by mixing more and less recent experience for the updates

Rare experience will be used for more than just a single update

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 23 / 31

Page 32: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Prioritized Experience Replay (PER)

Two design choices:

1 Which experiences to store?

2 Which experiences to replay?

PER tries to solve this

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 24 / 31

Page 33: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Prioritized Experience Replay (PER)

Two design choices:

1 Which experiences to store?

2 Which experiences to replay? PER tries to solve this

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 24 / 31

Page 34: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

PER: Example ‘Blind Cliffwalk’

Two actions: ‘right’ and ‘wrong’

The episode is terminated when ‘wrong’ action is chosen.

Taking the ‘right’ action progresses through a sequence of n states, at the end of whichlies a final reward of 1; reward is 0 elsewhere

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 25 / 31

Page 35: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Prioritized Experience Replay (PER): TD error

TD error for vanilla DQN:

δi = rt + γmaxa∈A

Qθ−(st+1, a)− Qθ(st , at)

TD error for Double DQN:

δi = rt + γQθ−(st+1, argmaxa∈AQθ(st+1, a))− Qθ(st , at)

we use |δi | as the magnitude of the TD error.

What |δi | shows us?

A big difference between our prediction and the TD target −→ we have to learn a lot

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 26 / 31

Page 36: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Prioritized Experience Replay (PER): TD error

TD error for vanilla DQN:

δi = rt + γmaxa∈A

Qθ−(st+1, a)− Qθ(st , at)

TD error for Double DQN:

δi = rt + γQθ−(st+1, argmaxa∈AQθ(st+1, a))− Qθ(st , at)

we use |δi | as the magnitude of the TD error.

What |δi | shows us?A big difference between our prediction and the TD target −→ we have to learn a lot

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 26 / 31

Page 37: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Prioritized Experience Replay (PER)

Two ways of getting priorities, denoted as pi :

1 Direct, proportional prioritization:

pi = |δi |+ ε

where ε is a small constant ensuring that the sample has some non-zero probability ofbeing drawn

2 A rank based method:

pi =1

rank(i)

where rank(i) is the rank of transition i when the replay memory is sorted according to |δi |

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 27 / 31

Page 38: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Prioritized Experience Replay (PER)

Two ways of getting priorities, denoted as pi :

1 Direct, proportional prioritization:

pi = |δi |+ ε

where ε is a small constant ensuring that the sample has some non-zero probability ofbeing drawn

2 A rank based method:

pi =1

rank(i)

where rank(i) is the rank of transition i when the replay memory is sorted according to |δi |

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 27 / 31

Page 39: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

PER: Stochastic sampling method

Problem: During exploration, pi terms are not known for brand-new samples.Solution: interpolate between pure greedy prioritization and uniform random sampling.Probability of sampling transition i

P(i) =pαi∑k p

αk

where pi > 0 is the priority of transition i ; α is the level of prioritization.

If α→ 0, there is no prioritization, because all p(i)α = 1 (uniform case)

If α→ 1, then we get to full prioritization, where sampling data points is more heavilydependent on the actual |δi | values.

This will ensure that the probability of being sampled is monotonic in a transition’s priority,while guaranteeing a non-zero probability even for the lowest-priority transition.

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 28 / 31

Page 40: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

PER: Stochastic sampling method

Problem: During exploration, pi terms are not known for brand-new samples.Solution: interpolate between pure greedy prioritization and uniform random sampling.Probability of sampling transition i

P(i) =pαi∑k p

αk

where pi > 0 is the priority of transition i ; α is the level of prioritization.

If α→ 0, there is no prioritization, because all p(i)α = 1 (uniform case)

If α→ 1, then we get to full prioritization, where sampling data points is more heavilydependent on the actual |δi | values.

This will ensure that the probability of being sampled is monotonic in a transition’s priority,while guaranteeing a non-zero probability even for the lowest-priority transition.

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 28 / 31

Page 41: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

PER: Stochastic sampling method

Problem: During exploration, pi terms are not known for brand-new samples.Solution: interpolate between pure greedy prioritization and uniform random sampling.Probability of sampling transition i

P(i) =pαi∑k p

αk

where pi > 0 is the priority of transition i ; α is the level of prioritization.

If α→ 0, there is no prioritization, because all p(i)α = 1 (uniform case)

If α→ 1, then we get to full prioritization, where sampling data points is more heavilydependent on the actual |δi | values.

This will ensure that the probability of being sampled is monotonic in a transition’s priority,while guaranteeing a non-zero probability even for the lowest-priority transition.

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 28 / 31

Page 42: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

PER: Stochastic sampling method

Problem: During exploration, pi terms are not known for brand-new samples.Solution: interpolate between pure greedy prioritization and uniform random sampling.Probability of sampling transition i

P(i) =pαi∑k p

αk

where pi > 0 is the priority of transition i ; α is the level of prioritization.

If α→ 0, there is no prioritization, because all p(i)α = 1 (uniform case)

If α→ 1, then we get to full prioritization, where sampling data points is more heavilydependent on the actual |δi | values.

This will ensure that the probability of being sampled is monotonic in a transition’s priority,while guaranteeing a non-zero probability even for the lowest-priority transition.

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 28 / 31

Page 43: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

PER: Importance-sampling (IS) weights

Use importance sampling weights to adjust the updating by reducing the weights of the oftenseen samples.

wi =

(1N· 1P(i)

)ββ is the exponent, which controls how much prioritization to apply.

For stability reasons, we always normalize weights by 1/maxi wi so that they only scale theupdate downwards.

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 29 / 31

Page 44: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

PER: Double DQN algorithm with proportional prioritization

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 30 / 31

Page 45: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Prioritized Experience Replay (PER): Summary

Built on top of experience replay buffers

Uniform sampling from a replay buffer is a good default strategy, but it can be improvedby prioritized sampling, that will weigh the samples so that “important” ones are drawnmore frequently for training.

Key idea is to increase the replay probability of experience tuples that have a highexpected learning progress (measured by |δ|. This lead to both faster learning and tobetter final policy quality, as compared to uniform experience replay.

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 31 / 31

Page 46: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Prioritized Experience Replay (PER): Summary

Built on top of experience replay buffers

Uniform sampling from a replay buffer is a good default strategy, but it can be improvedby prioritized sampling, that will weigh the samples so that “important” ones are drawnmore frequently for training.

Key idea is to increase the replay probability of experience tuples that have a highexpected learning progress (measured by |δ|. This lead to both faster learning and tobetter final policy quality, as compared to uniform experience replay.

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 31 / 31

Page 47: Dueling DQN & Prioritised Experience Replyavereshc/rl_fall19/lecture_14_1_Dueling_DQ… · CSE4/510 Reinforcement Learning Fall 2019 avereshc@buffalo.edu October10,2019 *Slides are

Prioritized Experience Replay (PER): Summary

Built on top of experience replay buffers

Uniform sampling from a replay buffer is a good default strategy, but it can be improvedby prioritized sampling, that will weigh the samples so that “important” ones are drawnmore frequently for training.

Key idea is to increase the replay probability of experience tuples that have a highexpected learning progress (measured by |δ|. This lead to both faster learning and tobetter final policy quality, as compared to uniform experience replay.

Alina Vereshchaka (UB) CSE4/510 Reinforcement Learning, Lecture 14.1 October 10, 2019 31 / 31


Recommended