+ All Categories
Home > Documents > Prediction, Control and Decisions Kenji Doya [email protected]

Prediction, Control and Decisions Kenji Doya [email protected]

Date post: 16-Jan-2016
Category:
Upload: heinz
View: 37 times
Download: 2 times
Share this document with a friend
Description:
Prediction, Control and Decisions Kenji Doya [email protected]. Initial Research Project, OIST ATR Computational Neuroscience Laboratories CREST, Japan Science and Technology Agency Nara Institute of Science and Technology. Outline. Introduction Cerebellum, basal ganglia, and cortex - PowerPoint PPT Presentation
Popular Tags:
59
Prediction, Control and Decisions Kenji Doya [email protected] Initial Research Project, OIST ATR Computational Neuroscience Laboratories CREST, Japan Science and Technology Agency Nara Institute of Science and Technology
Transcript
Page 1: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Prediction, Control and DecisionsKenji Doya

[email protected]

Initial Research Project, OISTATR Computational Neuroscience LaboratoriesCREST, Japan Science and Technology Agency

Nara Institute of Science and Technology

Page 2: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Outline

Introduction

Cerebellum, basal ganglia, and cortex

Meta-learning and neuromodulators

Prediction time scale and serotonin

Page 3: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Learning to Walk (Doya & Nakano, 1985)

Action: cycle of 4 posturesReward: speed sensor output

Multiple solutions: creeping, jumping,…

QuickTime˛ Ç∆H.263 êLí£ÉvÉçÉOÉâÉÄǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB

Page 4: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Learning to Stand Up (Morimoto &Doya, 2001)

QuickTime˛ Ç∆ÉVÉlÉpÉbÉN êLí£ÉvÉçÉOÉâÉÄ

ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇ…ÇÕïKóvÇ≈Ç∑ÅB

QuickTime˛ Ç∆ÉVÉlÉpÉbÉN êLí£ÉvÉçÉOÉâÉÄ

ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇ…ÇÕïKóvÇ≈Ç∑ÅB

early trials

after learning Reward: height of the headNo desired trajectory

Page 5: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Framework for learning state-action mapping (policy) by exploration and reward feedback

Criticreward prediction

Actoraction selection

Learningexternal reward rinternal reward : difference from prediction

Reinforcement Learning (RL)

environment

reward r

action a

state s

agentcritic

actor

Page 6: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Reinforcement Learning Methods

Model-free MethodsEpisode-based

parameterize policy P(a|s; )Temporal difference

state value function V(s)(state-)action value function Q(s,a)

Model-based methodsDynamic Programming

forward model P(s’|s,a)

Page 7: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Temporal Difference Learning

Predict reward: value functionV(s) = E[ r(t) + r(t+1) + 2r(t+2)…| s(t)=s]Q(s,a) = E[ r(t) + r(t+1) + 2r(t+2)…| s(t)=s, a(t)=a]

Select actiongreedy: a = argmax Q(s,a)Boltzmann: P(a|s) exp[ Q(s,a)]

Update prediction: TD error(t) = r(t) + V(s(t+1)) - V(s(t))V(s(t)) = (t)Q(s(t),a(t)) = (t)

Page 8: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Dynamic Programming and RL

Dynamic Programmingmodel-based, off-line

solve Bellman equationV(s) = maxa s’ [ P(s’|s,a) {r(s,a,s’) + V(s’)}]

Reinforcement Learningmodel-free, on-line

learn by TD error(t) = r(t) + V(s(t+1)) - V(s(t))V(s(t)) = (t)Q(s(t),a(t)) = (t)

Page 9: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Discrete vs. Continuous RL(Doya, 2000)

Discrete time

Continuous time

V (x) = E r(t) + γr(t + Δt) + γ 2r(t + 2Δt) + ...[ ]

δ(t) = r(t) + γV (t + Δt) −V (t)

V(x) = es−tτ r(s)ds

t

∫δ(t) =r(t)+ ˙ V (t) −

V(t)

τ=Δt

1−γ, γ =1−

Δtτ

Page 10: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Questions

Computational QuestionsHow to learn:

direct policy P(a|s)value functions V(s), Q(s,a)forward models P(s’|s,a)

When to use which method?Biological Questions

Where in the brain?How are they represented/updated?How are they selected/coordinated?

Page 11: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Brain HierarchyForebrainCerebral cortex (a)

neocortexpaleocortex: olfactory cortex archicortex: basal forebrain,

hippocampusBasal nuclei (b)

neostriatum: caudate, putamenpaleostriatum: globus pallidusarchistriatum: amygdala

Diencephalonthalamus (c)hypothalamus (d)

Brain stem & CerebellumMidbrain (e)Hindbrain

pons (f)cerebellum (g)

Medulla (h)Spinal cord (i)

Page 12: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Just for Motor Control?(Middleton & Strick 1994)

Basal ganglia (Globus Pallidus)

Prefrontal cortex (area46)

Cerebellum (dentate nucleus)

Page 13: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

thalamus

SN

IO

Cortex

BasalGanglia

Cerebellum

target

error+

-

outputinput

Cerebellum: Supervised Learning

reward

outputinput

Basal Ganglia: Reinforcement Learning

Cerebral Cortex : Unsupervised Learning

outputinput

Specialization by Learning Algorithms

(Doya, 1999)

Page 14: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Cerebellum

Purkinje cells~105 parallel fiberssingle climbing fiberlong-term depression

Supervised learningperceptron hypothesisinternal models

Page 15: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

early learning after learning

Internal Models in the Cerebellum

(Imamizu et al., 2000)

Learning to use ‘rotated’ mouse

Page 16: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Motor Imagery (Luft et al. 1998)

Finger movement Imagery of movement

Page 17: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Basal Ganglia

Striatumstriosome & matrixdopamine-dependent plasticity

Dopamine neuronsreward-predictive response

TD learning

Page 18: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

(a) äwèKëO

(b) äwèKå„

(c) ïÒèVǻǵ

ïÒèV r

ÉhÅ[ÉpÉ~Éìç◊ñE

ïÒèVó\ë™ V

ïÒèV r

ÉhÅ[ÉpÉ~Éìç◊ñE

ïÒèVó\ë™ V

ïÒèV r

ÉhÅ[ÉpÉ~Éìç◊ñE

ïÒèVó\ë™ V

r

V

r

V

r

V

Dopamine Neurons and TD Error

(t) = r(t) + V(s(t+1)) - V(s(t))before learning

after learning

omit reward

(Schultz et al. 1997)

Page 19: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Reward-predicting Activities of Striatal Neurons

Delayed saccade task (Kawagoe et al., 1998)

Not just actions, but resulting rewards

Reward: Right Up Left Down All

Target: Right

Up

Left

Down

Page 20: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Cerebral Cortex

Recurrent connectionsHebbian plasticity

Unsupervised learning, e.g., PCA, ICA

Page 21: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Replicating V1Receptive Fields

(Olshausen & Field, 1996)

Infomax and sparsenessHebbian plasticity and recurrent inhibition

Page 22: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Specialization by Learning?

Cerebellum: Supervised learningerror signal by climbing fibersforward model s’=f(s,a) and policy a=g(s)

Basal ganglia: Reinforcement leaningreward signal by dopamine fibersvalue functions V(s) and Q(s,a)

Cerebral cortex: Unsupervised learningHebbian plasticity and recurrent inhibitionrepresentation of state s and action a

But how are they recruited and combined?

Page 23: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Multiple Action Selection Schemes

Model-freea = argmaxa Q(s,a)

Model-baseda = argmaxa [r+V(f(s,a))]

forward model: f(s,a) Encapsulation

a = g(s)

sa

Qs’a

Vai

f

s

sa

g

Page 24: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Lectures at OCNC 2005

Internal models/CerebellumReza ShadmehrStefan SchaalMitsuo Kawato

Reward/Basal gangliaAndrew G. BartoBernard BalleinePeter DayanJohn O’DohertyMinoru KimuraWolfram Schultz

State coding/CortexNathaniel DawLeo SugrueDaeyeol LeeJun TanjiAnitha PasupathyMasamichi Sakagami

Page 25: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Outline

Introduction

Cerebellum, basal ganglia, and cortex

Meta-learning and neuromodulators

Prediction time scale and serotonin

Page 26: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Framework for learning state-action mapping (policy) by exploration and reward feedback

Criticreward prediction

Actoraction selection

Learningexternal reward rinternal reward : difference from prediction

Reinforcement Learning (RL)

environment

reward r

action a

state s

agentcritic

actor

Page 27: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Reinforcement Learning

Predict reward: value functionV(s) = E[ r(t) + r(t+1) + 2r(t+2)…| s(t)=s]Q(s,a) = E[ r(t) + r(t+1) + 2r(t+2)…| s(t)=s, a(t)=a]

Select actiongreedy: a = argmax Q(s,a)Boltzmann: P(a|s) exp[ Q(s,a)]

Update prediction: TD error(t) = r(t) + V(s(t+1)) - V(s(t))V(s(t)) = (t)Q(s(t),a(t)) = (t)

Page 28: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Cyber Rodent Project

Robots with same constraint as biological agents

What is the origin of rewards?What to be learned, what to be evolved?

Self-preservationcapture batteries

Self-reproductionexchange programs through IR ports

Page 29: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Cyber Rodent: Hardware

camera range sensor proximity sensors gyro

battery latch        two wheels

IR port speaker microphones R/G/B LED

Page 30: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Evolving Robot Colony

Survivalcatch battery packs

Reproductioncopy ‘genes’ through IR ports

QuickTime˛ Ç∆YUV420 ÉRÅ[ÉfÉbÉN êLí£ÉvÉçÉOÉâÉÄ

ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB

QuickTime˛ Ç∆YUV420 ÉRÅ[ÉfÉbÉN êLí£ÉvÉçÉOÉâÉÄ

ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB

Page 31: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Discounting Future Reward

large small

QuickTime˛ Ç∆DV/DVCPRO - NTSC êLí£ÉvÉçÉOÉâÉÄ

ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB

QuickTime˛ Ç∆DV/DVCPRO - NTSC êLí£ÉvÉçÉOÉâÉÄ

ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB

Page 32: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Setting of Reward Function

Reward r = rmain + rsupp - rcost

e.g., reward for vision of battery

QuickTime˛ Ç∆DV/DVCPRO - NTSC êLí£ÉvÉçÉOÉâÉÄ

ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB

Page 33: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Reinforcement Learning of Reinforcement Learning (Schweighfer&Doya, 2003)

Fluctuations in the metaparameters correlate with average reward

reward

Page 34: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Battery level

β

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

2

4

6

8

10

12

14

Randomness Control by Battery Level

Greedier action at both extremes

Page 35: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Neuromodulators for Metalearning

(Doya, 2002)

Metaparameter tuning is critical in RLHow does the brain tune them?

Dopamine: TD error Acetylcholine: learning rate Noradrenaline: inv. temp. Serotonin: discount

Page 36: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Learning Rate

V(s(t-1)) = (t)

Q(s(t-1),a(t-1)) = (t)small slow learninglarge unstable learning

Acetylcholine basal forebrainRegulate memory update and retention

(Hasselmo et al.)

LTP in cortex, hippocampustop-down and bottom-up information flow

Page 37: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Inverse Temperature

Greediness in action selection

P(ai|s) exp[ Q(s,ai)]

small exploration

large exploitation

Noradrenaline locus coeruleusCorrelation with performance accuracy

(Aston-Jones et al.)

Modulation of cellular I/O gain(Cohen et al.)

-4 -2 0 2 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Q(s,a1)-Q(s,a

2)

P(a

1)

=0=1=10

Page 38: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Serotonin dorsal rapheLow activity associated with impulsivity

depression, bipolar disordersaggression, eating disorders

Discount Factor

1 2 3 4 5 6 7 8 9 10

-1

-0.5

0

0.5

1

Time

Reward TextEnd

=0.5 =-0.093 V

1 2 3 4 5 6 7 8 9 10

-1

-0.5

0

0.5

1

Time

Reward TextEnd

=0.9 =+0.062 V

V(s(t)) = E[ r(t+1) + r(t+2) + 2r(t+3) + …]Balance between short- and long-term results

Page 39: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

TD Error

(t) = r(t) + V(s(t)) - V(s(t-1))

Global learning signal

reward prediction: V(s(t-1)) = (t)

reinforcement: Q(s(t-1),a(t-1)) = (t)

Dopamine substantia nigra, VTARespond to errors in reward predictionReinforcement of actions

addiction

Page 40: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

TD Model of Basal Ganglia(Houk et al. 1995, Montague et al. 1996, Schultz et al. 1997,...)

Striosome: state value V(s)Matrix: action value Q(s,a)

evaluation

action selection

state representation

actionoutput

sensoryinput

TD signal

Cerebral cortex

Striatum

Dopamine neurons

reward SNr, GP

Thalamus

s

V(s)

DA neurons: TD error

r

Q(s,a) a

SNr/GPi: action selection: Q(s,a) a

NA?

Ach?

5-HT?

Page 41: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Possible Control of Discount Factor

Modulation of TD error

Selection/weighting of parallel networks

V1 V2 V31 2 3

striatum

Dopamineneurons (t)

V(s(t))

V(s(t+1))

(t) = r(t) + γV (s(t +1)) −V (s(t))

Page 42: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Markov Decision Task(Tanaka et al., 2004)

State transition and reward functions

Stimulus and response

Page 43: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Behavior Results

All subjects successfully learned optimal behavior

Page 44: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Block-Design Analysis

SHORT vs. NO (p < 0.001 uncorrected)

LONG vs. SHORT (p < 0.0001 uncorrected)

OFC Insula Striatum Cerebellum

CerebellumStriatum Dorsal rapheDLPFC, VLPFC, IPC, PMd

Different brain areas involved in immediate and future reward prediction

Page 45: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Ventro-Dorsal Difference

Lateral PFC Insula Striatum

Page 46: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

 

Estimate V(t) and (t) from subjects’ performance dataRegression analysis of fMRI data

Model-based Regressor Analysis

fMRI data

Policy

reward r(t)

state s(t)

action a(t)

TD error (t)

Agent

Value functionV(s)

Value functionV(s)

TD error (t)

Environment

20yen

Page 47: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Explanatory Variables (subject NS)

Reward prediction V(t)

= 0

= 0.3

= 0.6

= 0.8

= 0.9

= 0.99

Reward prediction error t

= 0

= 0.3

= 0.6

= 0.8

= 0.9

= 0.99

1 312trial

Page 48: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Regression Analysis

mPFC Insula

x = -2 mm x = -42 mm

Reward prediction

V

Reward prediction error

Striatum

z = 2

Page 49: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Tryptophan Depletion/Loading

Tryptophan: precursor of serotonindepletion/loading affect central serotonin levels(e.g. Bjork et al. 2001, Luciana et al. 2001)

100 g of amino acid drinkexperiments after 6 hours

Day2: Tr0 Day3: Tr+Day1: Tr-

10.3g of tryptophan (Loading)

No tryptophan (Depletion)

2.3g of tryptophan(Control)

Page 50: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Blood Tryptophan LevelsBlood Tryptophan Levels

N.D. (< 3.9 g/ml)

Page 51: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Delayed Reward Choice TaskDelayed Reward Choice Task

Page 52: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Delayed Reward Choice Task

Sessions

Initial black patches

Patches/step

Yellow White Yellow White

1,2,7,8 72 24

18 9 8 2 6 2

3 72 24

18 9 8 2 14 2

4 72 24

18 9 16 2 14 2

5,6 72 24

18 9 16 2 6 2

yellow: large reward with long delaywhite: small reward with short delay

Page 53: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Choice Behaviors

Shift of indifference linenot consistent among 12 subjects

Page 54: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Modulation of Striatal Response

Tr0

0.990.90.80.70.6

Tr- Tr+

Page 55: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Modulation by Tr Levels

QuickTime˛ Ç∆TIFFÅiLZWÅj êLí£ÉvÉçÉOÉâÉÄ

ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB

QuickTime˛ Ç∆TIFFÅiLZWÅj êLí£ÉvÉçÉOÉâÉÄ

ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB

QuickTime˛ Ç∆TIFFÅiLZWÅj êLí£ÉvÉçÉOÉâÉÄ

ǙDZÇÃÉsÉNÉ`ÉÉÇ å©ÇÈÇΩÇflÇ…ÇÕïKóvÇ≈Ç∑ÅB

Page 56: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Changes in Correlation CoefficientChanges in Correlation Coefficient

= 0.6(28, 0, -4)

= 0.99(16, 2, 28)

Tr- < Tr+correlation with V at large in dorsal Putamen

Tr- > Tr+correlation with V at small in ventral PutamenR

egre

ssio

n s

lop

eR

egre

ssio

n s

lop

e

ROI (region of interest) analysis

Page 57: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Summary

Immediate rewardlateral OFC

Future rewardparietal, PMd, DLPFlateranl cerebellumdorsal raphe

Ventro-dorsal gradientinsulastriatum

Serotonergic modulation

Page 58: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Outline

Introduction

Cerebellum, basal ganglia, and cortex

Meta-learning and neuromodulators

Prediction time scale and serotonin

Page 59: Prediction, Control and Decisions Kenji Doya doya@irp.oist.jp

Collaborators

Kyoto PUMMinoru KimuraYasumasa Ueda

Hiroshima UShigeto YamawakiYasumasa OkamotoGo OkadaKazutaka UedaShuji AsahiKazuhiro Shishida

ATRJun MorimotoKazuyuki Samejima

CRESTNicolas SchweighoferGenci Capi

NAISTSaori Tanaka

OISTEiji UchibeStefan Elfwing


Recommended