+ All Categories
Home > Documents > Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement...

Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement...

Date post: 11-Oct-2018
Category:
Upload: vuthu
View: 219 times
Download: 0 times
Share this document with a friend
27
Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim <[email protected]>
Transcript
Page 1: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Lecture 7: DQN

Reinforcement Learning with TensorFlow&OpenAI GymSung Kim <[email protected]>

Page 2: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Q-function Approximation: Q-Nets

(1) state, s

(2) quality (reward)for all actions(eg, [0.5, 0.1, 0.0, 0.8] LEFT: 0.5, RIGHT 0.1 UP: 0.0, DOWN: 0.8)

111 11111

1

1

Page 3: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Q-Nets are unstable

Page 4: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Convergence

Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

min

TX

t=0

[

ˆQ(st, at|✓)� (rt + �max

a0ˆQ(st+1, a

0|✓))]2

Page 5: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Reinforcement + Neural Net

http://stackoverflow.com/questions/10722064/training-a-neural-network-with-reinforcement-learning

Page 6: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

NATURE.COM/NATURE26 February 2015 £10

Vol. 518, No. 7540

EPIDEMIOLOGY

SHARE DATA IN OUTBREAKS

Forge open access to sequences and more

PAGE 477

COSMOLOGY

A GIANT IN THE EARLY UNIVERSE

A supermassive black hole at a redshift of 6.3

PAGES 490 & 512

QUANTUM PHYSICS

TELEPORTATION FOR TWO

Transferring two properties of a single photon

PAGES 491 & 516

INNOVATIONS INThe microbiome

Self-taught AI software attains human-level

performance in video games PAGES 486 & 529

T H E I N T E R N AT I O N A L W E E K LY J O U R N A L O F S C I E N C E

Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

Page 7: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Two big issues

Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

Page 8: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

1. Correlations between samples

Playing Atari with Deep Reinforcement Learning - University of Toronto by V Mnih et al.

Page 9: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

1. Correlations between samples

Playing Atari with Deep Reinforcement Learning - University of Toronto by V Mnih et al.

Page 10: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

1. Correlations between samples

Page 11: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Prerequisite: http://hunkim.github.io/ml/ orhttps://www.inflearn.com/course/기본적인-머신러닝-딥러닝-강좌/

Page 12: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

1. Correlations between samples

Page 13: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

2. Non-stationary targets

min

TX

t=0

[

ˆQ(st, at|✓)� (rt + �max

a0ˆQ(st+1, a

0|✓))]2

Page 14: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

2. Non-stationary targets

min

TX

t=0

[

ˆQ(st, at|✓)� (rt + �max

a0ˆQ(st+1, a

0|✓))]2

Y = Q(st, at|✓) Y = rt + �max

a0ˆQ✓(st+1, a

0|✓)

Page 15: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

DQN’s three solutions

1. Go deep

2. Capture and replay • Correlations between samples

3. Separate networks: create a target network • Non-stationary targets

Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

Page 16: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Human-level control through deep reinforcement learning, Nature http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html

Solution 1: go deep

Page 17: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Solution 1: go deep

ICML 2016 Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

Page 18: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Problem 2: correlations between samples

Page 19: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Solution 2: experience replayDeep Q-Networks (DQN): Experience Replay

To remove correlations, build data-set from agent’s own experience

s1, a1, r2, s2s2, a2, r3, s3 ! s, a, r , s 0

s3, a3, r4, s4...

s

t

, at

, rt+1, st+1 ! s

t

, at

, rt+1, st+1

Sample experiences from data-set and apply update

l =

✓r + � max

a

0Q(s 0, a0,w�) � Q(s, a,w)

◆2

To deal with non-stationarity, target parameters w� are held fixed

Capturerandom sample

& Replaymin

TX

t=0

[

ˆQ(st, at|✓)� (rt + �max

a0ˆQ(st+1, a

0|✓))]2

ICML 2016 Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

Page 20: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Solution 2: experience replay

Playing Atari with Deep Reinforcement Learning - University of Toronto by V Mnih et al.

Page 21: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Problem 2: correlations between samplesDeep Q-Networks (DQN): Experience Replay

To remove correlations, build data-set from agent’s own experience

s1, a1, r2, s2s2, a2, r3, s3 ! s, a, r , s 0

s3, a3, r4, s4...

s

t

, at

, rt+1, st+1 ! s

t

, at

, rt+1, st+1

Sample experiences from data-set and apply update

l =

✓r + � max

a

0Q(s 0, a0,w�) � Q(s, a,w)

◆2

To deal with non-stationarity, target parameters w� are held fixed

Page 22: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Problem 3: non-stationary targets

min

TX

t=0

[

ˆQ(st, at|✓)� (rt + �max

a0ˆQ(st+1, a

0|✓))]2

Y = Q(st, at|✓) Y = rt + �max

a0ˆQ✓(st+1, a

0|✓)

Page 23: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Human-level control through deep reinforcement learning, Nature http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html

Solution 3: separate target network

min

TX

t=0

[

ˆQ(st, at|✓)� (rt + �max

a0ˆQ(st+1, a

0|✓))]2

min

TX

t=0

[

ˆQ(st, at|✓)� (rt + �max

a0ˆQ(st+1, a

0|¯✓))]2

Page 24: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Solution 3: separate target network

min

TX

t=0

[

ˆQ(st, at|✓)� (rt + �max

a0ˆQ(st+1, a

0|¯✓))]2

(2)Ws(1)s

(1)s (2) Y (target)

Page 25: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Solution 3: copy network

(2)Ws(1)s

(1)s (2) Y (target)

Page 26: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Understanding Nature

Paper (2015)

Human-level control through deep reinforcement learning, Nature http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html

Page 27: Lecture 7: DQN - GitHub Pageshunkim.github.io/ml/RL/rl07.pdf · Lecture 7: DQN Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim  Q-function Approximation:

Next

Lab: DQN


Recommended