+ All Categories
Home > Documents > ANeuralSubstrateof PredictionandReward - UCLdayan/papers/sdm97.pdf ·...

ANeuralSubstrateof PredictionandReward - UCLdayan/papers/sdm97.pdf ·...

Date post: 11-Sep-2018
Category:
Upload: lamdan
View: 215 times
Download: 0 times
Share this document with a friend
7
48. H. S. Mayberg et al., Ann. Neurol. 28, 57 (1990). 49. R. M. Cohen et al., Neuropsychopharmacology 2, 241 (1989). 50. J. E. LeDoux, Sci. Am. 6, 50 (June 1994); M. Davis, Annu. Rev. Neurosci. 15, 353 (1992). 51. J. E. LeDoux, Curr. Opin. Neurobiol. 2, 191 (1992); L. M. Romanski and J. E. LeDoux, J. Neurosci. 12, 4501 (1992); J. L. Armony, J. D. Cohen, D. Servan-Schre- iber, J. E. LeDoux, Behav. Neurosci. 109, 246 (1995). 52. K. P. Corodimas and J. E. LeDoux, Behav. Neurosci. 109, 613 (1995). 53. M. J. D. Miserendino, C. B. Sananes, K. R. Melia, M. Davis, Nature 345, 716 (1990); C. Farb et al., Brain Res. 593, 145 (1992); M. Davis, D. Rainne, M. Cassell, Trends Neurosci. 17, 208 (1994). 54. M. E. P. Seligman, J. Abnorm. Psychol. 74, 1 (1976); F. Schneider et al., Am. J. Psychiatry 153, 206 (1996). 55. W. C. Drevets et al., J. Neuroscience 12, 3628 (1992). 56. Ssupported in part by National Institiute of Mental Health grants MH31593, MH40856, and MH- CRC43271; by a Research Scientist Award, MH00625; and by an Established Investigator Award from the National Association for Research in Schizophrenia and Affective Disorders. A Neural Substrate of Prediction and Reward Wolfram Schultz, Peter Dayan, P. Read Montague* The capacity to predict future events permits a creature to detect, model, and manipulate the causal structure of its interactions with its environment. Behavioral experiments suggest that learning is driven by changes in the expectations about future salient events such as rewards and punishments. Physiological work has recently complemented these studies by identifying dopaminergic neurons in the primate whose fluctuating output apparently signals changes or errors in the predictions of future salient and rewarding events. Taken together, these findings can be understood through quantitative theories of adaptive optimizing control. An adaptive organism must be able to predict future events such as the presence of mates, food, and danger. For any creature, the features of its niche strongly constrain the time scales for prediction that are likely to be useful for its survival. Predictions give an animal time to prepare behavioral reac- tions and can be used to improve the choic- es an animal makes in the future. This anticipatory capacity is crucial for deciding between alternative courses of action be- cause some choices may lead to food where- as others may result in injury or loss of resources. Experiments show that animals can pre- dict many different aspects of their environ- ments, including complex properties such as the spatial locations and physical character- istics of stimuli (1). One simple, yet useful prediction that animals make is the proba- ble time and magnitude of future rewarding events. “Reward” is an operational concept for describing the positive value that a crea- ture ascribes to an object, a behavioral act, or an internal physical state. The function of reward can be described according to the behavior elicited (2). For example, appeti- tive or rewarding stimuli induce approach behavior that permits an animal to con- sume. Rewards may also play the role of positive reinforcers where they increase the frequency of behavioral reactions during learning and maintain well-established ap- petitive behaviors after learning. The re- ward value associated with a stimulus is not a static, intrinsic property of the stimulus. Animals can assign different appetitive val- ues to a stimulus as a function of their internal states at the time the stimulus is encountered and as a function of their ex- perience with the stimulus. One clear connection between reward and prediction derives from a wide variety of conditioning experiments (1). In these experiments, arbitrary stimuli with no in- trinsic reward value will function as reward- ing stimuli after being repeatedly associated in time with rewarding objects—these ob- jects are one form of unconditioned stimu- lus (US). After such associations develop, the neutral stimuli are called conditioned stimuli (CS). In the descriptions that fol- low, we call the appetitive CS the sensory cue and the US the reward. It should be kept in mind, however, that learning that depends on CS-US pairing takes many dif- ferent forms and is not always dependent on reward (for example, learning associated with aversive stimuli). In standard condi- tioning paradigms, the sensory cue must consistently precede the reward in order for an association to develop. After condition- ing, the animal’s behavior indicates that the sensory cue induces a prediction about the likely time and magnitude of the reward and tends to elicit approach behavior. It appears that this form of learning is associ- ated with a transfer of an appetitive or approach-eliciting component of the re- ward back to the sensory cue. Some theories of reward-dependent learning suggest that learning is driven by the unpredictability of the reward by the sensory cue (3, 4). One of the main ideas is that no further learning takes place when the reward is entirely predicted by a sensory cue (or cues). For example, if presentation of a light is consistently followed by food, a rat will learn that the light predicts the future arrival of food. If, after such training, the light is paired with a sound and this pair is consistently followed by food, then some- thing unusual happens—the rat’s behavior indicates that the light continues to predict food, but the sound predicts nothing. This phenomenon is called “blocking.” The pre- diction-based explanation is that the light fully predicts the food that arrives and the presence of the sound adds no new predic- tive (useful) information; therefore, no as- sociation developed to the sound (5). It appears therefore that learning is driven by deviations or “errors” between the predicted time and amount of rewards and their ac- tual experienced times and magnitudes [but see (4)]. Engineered systems that are designed to optimize their actions in complex environ- ments face the same challenges as animals, except that the equivalent of rewards and punishments are determined by design goals. One established method by which artificial systems can learn to predict is called the temporal difference (TD) algo- rithm (6). This algorithm was originally inspired by behavioral data on how animals actually learn predictions (7). Real-world applications of TD models abound. The predictions learned by TD methods can also be used to implement a technique called dynamic programming, which specifies how a system can come to choose appropriate actions. In this article, we review how these computational methods provide an inter- pretation of the activity of dopamine neu- rons thought to mediate reward-processing and reward-dependent learning. The con- nection between the computational theory and the experimental results is striking and provides a quantitative framework for future experiments and theories on the computa- tional roles of ascending monoaminergic systems (8–13). W. Schultz is at the Institute of Physiology, University of Fribourg, CH-1700 Fribourg, Switzerland. E-mail: [email protected] P. Dayan is in the Depart- ment of Brain and Cognitive Sciences, Center for Bio- logical and Computational Learning, E-25 MIT, Cam- bridge, MA 02139, USA. E-mail: [email protected] P. R. Montague is in the Division of Neuroscience, Center for Theoretical Neuroscience, Baylor College of Medi- cine, 1 Baylor Plaza, Houston, TX 77030, USA. E-mail: [email protected] * To whom correspondence should be addressed. ARTICLES http://www.sciencemag.org z SCIENCE z VOL. 275 z 14 MARCH 1997 1593
Transcript
Page 1: ANeuralSubstrateof PredictionandReward - UCLdayan/papers/sdm97.pdf · calledthetemporaldifference(TD)algo-rithm (6). This algorithm was originally inspiredbybehavioraldataonhowanimals

48. H. S. Mayberg et al., Ann. Neurol. 28, 57 (1990).49. R. M. Cohen et al., Neuropsychopharmacology 2,

241 (1989).50. J. E. LeDoux, Sci. Am. 6, 50 (June 1994); M. Davis,

Annu. Rev. Neurosci. 15, 353 (1992).51. J. E. LeDoux, Curr. Opin. Neurobiol. 2, 191 (1992); L.

M. Romanski and J. E. LeDoux, J. Neurosci. 12, 4501(1992); J. L. Armony, J. D. Cohen, D. Servan-Schre-iber, J. E. LeDoux, Behav. Neurosci. 109, 246 (1995).

52. K. P. Corodimas and J. E. LeDoux, Behav. Neurosci.109, 613 (1995).

53. M. J. D. Miserendino, C. B. Sananes, K. R. Melia,M. Davis, Nature 345, 716 (1990); C. Farb et al.,

Brain Res. 593, 145 (1992); M. Davis, D. Rainne, M.Cassell, Trends Neurosci. 17, 208 (1994).

54. M. E. P. Seligman, J. Abnorm. Psychol. 74, 1 (1976);F. Schneider et al., Am. J. Psychiatry 153, 206(1996).

55. W. C. Drevets et al., J. Neuroscience 12, 3628(1992).

56. Ssupported in part by National Institiute of MentalHealth grants MH31593, MH40856, and MH-CRC43271; by a Research Scientist Award,MH00625; and by an Established Investigator Awardfrom the National Association for Research inSchizophrenia and Affective Disorders.

A Neural Substrate ofPrediction and Reward

Wolfram Schultz, Peter Dayan, P. Read Montague*

The capacity to predict future events permits a creature to detect, model, andmanipulatethe causal structure of its interactions with its environment. Behavioral experimentssuggest that learning is driven by changes in the expectations about future salient eventssuch as rewards and punishments. Physiological work has recently complemented thesestudies by identifying dopaminergic neurons in the primate whose fluctuating outputapparently signals changes or errors in the predictions of future salient and rewardingevents. Taken together, these findings can be understood through quantitative theoriesof adaptive optimizing control.

An adaptive organism must be able topredict future events such as the presence ofmates, food, and danger. For any creature,the features of its niche strongly constrainthe time scales for prediction that are likelyto be useful for its survival. Predictions givean animal time to prepare behavioral reac-tions and can be used to improve the choic-es an animal makes in the future. Thisanticipatory capacity is crucial for decidingbetween alternative courses of action be-cause some choices may lead to food where-as others may result in injury or loss ofresources.

Experiments show that animals can pre-dict many different aspects of their environ-ments, including complex properties such asthe spatial locations and physical character-istics of stimuli (1). One simple, yet usefulprediction that animals make is the proba-ble time and magnitude of future rewardingevents. “Reward” is an operational conceptfor describing the positive value that a crea-ture ascribes to an object, a behavioral act,

or an internal physical state. The functionof reward can be described according to thebehavior elicited (2). For example, appeti-tive or rewarding stimuli induce approachbehavior that permits an animal to con-sume. Rewards may also play the role ofpositive reinforcers where they increase thefrequency of behavioral reactions duringlearning and maintain well-established ap-petitive behaviors after learning. The re-ward value associated with a stimulus is nota static, intrinsic property of the stimulus.Animals can assign different appetitive val-ues to a stimulus as a function of theirinternal states at the time the stimulus isencountered and as a function of their ex-perience with the stimulus.

One clear connection between rewardand prediction derives from a wide varietyof conditioning experiments (1). In theseexperiments, arbitrary stimuli with no in-trinsic reward value will function as reward-ing stimuli after being repeatedly associatedin time with rewarding objects—these ob-jects are one form of unconditioned stimu-lus (US). After such associations develop,the neutral stimuli are called conditionedstimuli (CS). In the descriptions that fol-low, we call the appetitive CS the sensorycue and the US the reward. It should bekept in mind, however, that learning thatdepends on CS-US pairing takes many dif-ferent forms and is not always dependent onreward (for example, learning associated

with aversive stimuli). In standard condi-tioning paradigms, the sensory cue mustconsistently precede the reward in order foran association to develop. After condition-ing, the animal’s behavior indicates that thesensory cue induces a prediction about thelikely time and magnitude of the rewardand tends to elicit approach behavior. Itappears that this form of learning is associ-ated with a transfer of an appetitive orapproach-eliciting component of the re-ward back to the sensory cue.

Some theories of reward-dependentlearning suggest that learning is driven bythe unpredictability of the reward by thesensory cue (3, 4). One of the main ideas isthat no further learning takes place whenthe reward is entirely predicted by a sensorycue (or cues). For example, if presentationof a light is consistently followed by food, arat will learn that the light predicts thefuture arrival of food. If, after such training,the light is paired with a sound and this pairis consistently followed by food, then some-thing unusual happens—the rat’s behaviorindicates that the light continues to predictfood, but the sound predicts nothing. Thisphenomenon is called “blocking.” The pre-diction-based explanation is that the lightfully predicts the food that arrives and thepresence of the sound adds no new predic-tive (useful) information; therefore, no as-sociation developed to the sound (5). Itappears therefore that learning is driven bydeviations or “errors” between the predictedtime and amount of rewards and their ac-tual experienced times and magnitudes [butsee (4)].

Engineered systems that are designed tooptimize their actions in complex environ-ments face the same challenges as animals,except that the equivalent of rewards andpunishments are determined by designgoals. One established method by whichartificial systems can learn to predict iscalled the temporal difference (TD) algo-rithm (6). This algorithm was originallyinspired by behavioral data on how animalsactually learn predictions (7). Real-worldapplications of TD models abound. Thepredictions learned by TD methods can alsobe used to implement a technique calleddynamic programming, which specifies howa system can come to choose appropriateactions. In this article, we review how thesecomputational methods provide an inter-pretation of the activity of dopamine neu-rons thought to mediate reward-processingand reward-dependent learning. The con-nection between the computational theoryand the experimental results is striking andprovides a quantitative framework for futureexperiments and theories on the computa-tional roles of ascending monoaminergicsystems (8–13).

W. Schultz is at the Institute of Physiology, University ofFribourg, CH-1700 Fribourg, Switzerland. E-mail:[email protected] P. Dayan is in the Depart-ment of Brain and Cognitive Sciences, Center for Bio-logical and Computational Learning, E-25 MIT, Cam-bridge, MA 02139, USA. E-mail: [email protected] P.R. Montague is in the Division of Neuroscience, Centerfor Theoretical Neuroscience, Baylor College of Medi-cine, 1 Baylor Plaza, Houston, TX 77030, USA. E-mail:[email protected]

*To whom correspondence should be addressed.

ARTICLES

http://www.sciencemag.org z SCIENCE z VOL. 275 z 14 MARCH 1997 1593

Page 2: ANeuralSubstrateof PredictionandReward - UCLdayan/papers/sdm97.pdf · calledthetemporaldifference(TD)algo-rithm (6). This algorithm was originally inspiredbybehavioraldataonhowanimals

Information Encoded inDopaminergic Activity

Dopamine neurons of the ventral tegmentalarea (VTA) and substantia nigra have longbeen identified with the processing of re-warding stimuli. These neurons send theiraxons to brain structures involved in moti-vation and goal-directed behavior, for ex-ample, the striatum, nucleus accumbens,and frontal cortex. Multiple lines of evi-dence support the idea that these neuronsconstruct and distribute information aboutrewarding events.

First, drugs like amphetamine and co-caine exert their addictive actions in part byprolonging the influence of dopamine ontarget neurons (14). Second, neural path-ways associated with dopamine neurons areamong the best targets for electrical self-stimulation. In these experiments, rats pressbars to excite neurons at the site of an im-planted electrode (15). The rats oftenchoose these apparently rewarding stimuliover food and sex. Third, animals treatedwith dopamine receptor blockers learn lessrapidly to press a bar for a reward pellet (16).All the above results generally implicatemidbrain dopaminergic activity in reward-dependent learning. More precise informa-tion about the role played by midbrain do-paminergic activity derives from experimentsin which activity of single dopamine neuronsis recorded in alert monkeys while they per-form behavioral acts and receive rewards.

In these latter experiments (17), dopa-mine neurons respond with short, phasicactivations when monkeys are presentedwith various appetitive stimuli. For exam-ple, dopamine neurons are activated whenanimals touch a small morsel of apple orreceive a small quantity of fruit juice to themouth as liquid reward (Fig. 1). These pha-sic activations do not, however, discrimi-nate between these different types of re-warding stimuli. Aversive stimuli like airpuffs to the hand or drops of saline to themouth do not cause these same transientactivations. Dopamine neurons are also ac-tivated by novel stimuli that elicit orientingreactions; however, for most stimuli, thisactivation lasts for only a few presentations.The responses of these neurons are relative-ly homogeneous—different neurons re-spond in the same manner and differentappetitive stimuli elicit similar neuronal re-sponses. All responses occur in the majorityof dopamine neurons (55 to 80%).

Surprisingly, after repeated pairings ofvisual and auditory cues followed by reward,dopamine neurons change the time of theirphasic activation from just after the time ofreward delivery to the time of cue onset. Inone task, a naı̈ve monkey is required totouch a lever after the appearance of a smalllight. Before training and in the initialphases of training, most dopamine neuronsshow a short burst of impulses after rewarddelivery (Fig. 1, top). After several days oftraining, the animal learns to reach for the

lever as soon as the light is illuminated, andthis behavioral change correlates with tworemarkable changes in the dopamine neu-ron output: (i) the primary reward no longerelicits a phasic response; and (ii) the onsetof the (predictive) light now causes a phasicactivation in dopamine cell output (Fig. 1,middle). The changes in dopaminergic ac-tivity strongly resemble the transfer of ananimal’s appetitive behavioral reactionfrom the US to the CS.

In trials where the reward is not deliv-ered at the appropriate time after the onsetof the light, dopamine neurons are de-pressed markedly below their basal firingrate exactly at the time that the rewardshould have occurred (Fig. 1, bottom). Thiswell-timed decrease in spike output showsthat the expected time of reward deliverybased on the occurrence of the light is alsoencoded in the fluctuations in dopaminer-gic activity (18). In contrast, very few do-pamine neurons respond to stimuli that pre-dict aversive outcomes.

The language used in the foregoing de-scription already incorporates the idea thatdopaminergic activity encodes expectationsabout external stimuli or reward. This inter-pretation of these data provides a link to anestablished body of computational theory (6,7). From this perspective, one sees that dopa-mine neurons do not simply report the occur-rence of appetitive events. Rather, their out-puts appear to code for a deviation or errorbetween the actual reward received and pre-dictions of the time and magnitude of reward.These neurons are activated only if the timeof the reward is uncertain, that is, unpredictedby any preceding cues. Dopamine neurons aretherefore excellent feature detectors of the“goodness” of environmental events relativeto learned predictions about those events.They emit a positive signal (increased spikeproduction) if an appetitive event is betterthan predicted, no signal (no change in spikeproduction) if an appetitive event occurs aspredicted, and a negative signal (decreasedspike production) if an appetitive event isworse than predicted (Fig. 1).

Computational Theory and Model

The TD algorithm (6, 7) is particularly wellsuited to understanding the functional roleplayed by the dopamine signal in terms ofthe information it constructs and broadcasts(8, 10, 12). This work has used fluctuationsin dopamine activity in dual roles (i) as asupervisory signal for synaptic weightchanges (8, 10, 12) and (ii) as a signal toinfluence directly and indirectly the choiceof behavioral actions in humans and bees(9–11). Temporal difference methods havebeen used in a wide spectrum of engineeringapplications that seek to solve prediction

Reward predictedReward occurs

No predictionReward occurs

Reward predictedNo reward occurs

(No CS)

(No R)CS-1 0 1 2 s

CS

R

R

Do dopamine neurons report an error in the prediction of reward?

Fig. 1. Changes in dopamine neurons’output code for an error in the prediction ofappetitive events. (Top) Before learning, adrop of appetitive fruit juice occurs in theabsence of prediction—hence a positiveerror in the prediction of reward. The do-pamine neuron is activated by this unpre-dicted occurrence of juice. (Middle) Afterlearning, the conditioned stimulus predictsreward, and the reward occurs accordingto the prediction—hence no error in theprediction of reward. The dopamine neu-ron is activated by the reward-predictingstimulus but fails to be activated by thepredicted reward (right). (Bottom) Afterlearning, the conditioned stimulus predictsa reward, but the reward fails to occur be-cause of a mistake in the behavioral re-sponse of the monkey. The activity of thedopamine neuron is depressed exactly atthe time when the reward would have oc-curred. The depression occurs more than1 s after the conditioned stimulus withoutany intervening stimuli, revealing an inter-nal representation of the time of the pre-dicted reward. Neuronal activity is alignedon the electronic pulse that drives the solenoid valve delivering the reward liquid (top) or the onset of theconditioned visual stimulus (middle and bottom). Each panel shows the peri-event time histogram andraster of impulses from the same neuron. Horizontal distances of dots correspond to real-time intervals.Each line of dots shows one trial. Original sequence of trials is plotted from top to bottom. CS,conditioned, reward-predicting stimulus; R, primary reward.

SCIENCE z VOL. 275 z 14 MARCH 1997 z http://www.sciencemag.org1594

Page 3: ANeuralSubstrateof PredictionandReward - UCLdayan/papers/sdm97.pdf · calledthetemporaldifference(TD)algo-rithm (6). This algorithm was originally inspiredbybehavioraldataonhowanimals

problems analogous to those faced by livingcreatures (19). Temporal difference meth-ods were introduced into the psychologicaland biological literature by Richard Suttonand Andrew Barto in the early 1980s (6, 7).It is therefore interesting that this methodyields some insight into the output of do-pamine neurons in primates.

There are two main assumptions in TD.First, the computational goal of learning isto use the sensory cues to predict a dis-counted sum of all future rewards V(t) with-in a learning trial:

V(t) 5 E[g0r(t) 1 g1r(t 1 1)

1 g2r(t 1 2) 1 z z z] (1)

where r(t) is the reward at time t and E[z]denotes the expected value of the sum offuture rewards up to the end of the trial. 0 #g # 1 is a discount factor that makes re-wards that arrive sooner more importantthan rewards that arrive later. Predictingthe sum of future rewards is an importantgeneralization over static conditioningmodels like the Rescorla-Wagner rule forclassical conditioning (1–4). The secondmain assumption is the Markovian one,that is, the presentation of future sensorycues and rewards depends only on the im-mediate (current) sensory cues and not thepast sensory cues.

As explained below, the strategy is to usea vector describing the presence of sensorycues x(t) in the trial along with a vector ofadaptable weights w to make an estimateV̂(t) of the true V(t). The reason that thesensory cue is written as a vector is explainedbelow. The difficulty in adjusting weights wto estimate V(t) is that the system (that is,the animal) would have to wait to receive allits future rewards in a trial r(t 1 1), r(t 12), . . . to assess its predictions. This latterconstraint would require the animal to re-member over time which weights needchanging and which weights do not.

Fortunately, there is information avail-able at each instant in time that can act asa surrogate prediction error. This possibilityis implicit in the definition of V(t) becauseit satisfies a condition of consistencythrough time:

V(t) 5 E[r(t) 1 gV(t 1 1)] (2)

An error in the estimated predictions cannow be defined with information availableat successive time steps:

d~t) 5 r(t) 1 gV̂(t 1 1) 2 V̂(t) (3)

This d(t) is called the TD error and acts asa surrogate prediction error signal that isinstantly available at time t 1 1. As de-scribed below, d(t) is used to improve theestimates of V(t) and also to choose appro-priate actions.

Representing a stimulus through time. Wesuggested above that a set of sensory cuesalong with an associated set of adaptableweights would suffice to estimate V(t) (thediscounted sum of future rewards). It is,however, not sufficient for the representa-tion of each sensory cue (for example, alight) to have only one associated adaptableweight because such a model would notaccount for the data shown above—itwould not be able to represent both thetime of the cue and the time of rewarddelivery. These experimental data showthat a sensory cue can predict reward deliv-ery at arbitrary times into the near future.This conclusion holds for both the mon-keys’ behavior and the output of the dopa-mine neurons. If the time of reward deliveryis changed relative to the time of cue onset,then the same cue will come to predict thenew time of reward delivery. The way inwhich such temporal labels are constructedin neural tissue is not known, but it is clearthat they exist (20).

Given these facts, we assume that eachsensory cue consists of a vector of signalsx(t) 5 {x1(t), x2(t), z z z } that represent thelight for variable lengths of time into thefuture, that is, xi(t) is 1 exactly i time stepsafter the presentation of the light in thetrial and 0 otherwise (Fig. 2B). Each com-ponent of x(t), xi(t), has its own predictionweight wi (Fig. 2B). This representationmeans that if the light comes on at time s,x1(s 1 1) 5 1, x2(s 1 2) 5 1, . . . representthe light at 1, 2, . . . time steps into thefuture and w1, w2, . . . are the respectiveweights. The net prediction for cue x(t) attime t takes the simple linear form

V̂(t) [ V̂(x(t)) 5 Siwixi~t) (4)

This form of temporal representation iswhat Sutton and Barto (7) call a completeserial-compound stimulus and is related toGrossberg’s spectral timing model (21).Unfortunately, virtually nothing is knownabout how the brain represents a stimulusfor substantial periods of time into thefuture; therefore, all temporal representa-tions are underconstrained from a biolog-ical perspective.

As in trial-based models like the Res-corla-Wagner rule, the adaptable weights ware improved according to the correlationbetween the stimulus representations andthe prediction error. The change in weightsfrom one trial to the next is

Dwi 5 axStxi(t)d(t) (5)

where ax is the learning rate for cue x(t)and the sum over t is taken over the courseof a trial. It has been shown that undercertain conditions this update rule (Eq. 5)will cause V̂(t) to converge to the true V(t)(22). If there were many different sensory

cues, each would have its own vector rep-resentation and its own vector of weights,and Eq. 4 would be summed over all thecues.

Comparing model and data.We now turnthis apparatus toward the neural and behav-ioral data described above. To constructand use an error signal similar to the TDerror above, a neural system would need topossess four basic features: (i) access to ameasure of reward value r(t); (ii) a signalmeasuring the temporal derivative of theongoing prediction of reward gV̂(t 1 1) 2V̂(t); (iii) a site where these signals could besummed; and (iv) delivery of the error sig-nal to areas constructing the prediction insuch a way that it can control plasticity.

It has been previously proposed thatmidbrain dopamine neurons satisfy features

Fig. 2. Constructing and using a prediction error.(A) Interpretation of the anatomical arrangementof inputs and outputs of the ventral tegmental area( VTA). M1 and M2 represent two different corticalmodalities whose output is assumed to arrive atthe VTA in the form of a temporal derivative (sur-prise signal) V̇(t), which reflects the degree towhich the current sensory state differs from theprevious sensory state. The high degree of con-vergence forces V̇(t) to arrive at the VTA as ascalar signal. Information about reward r(t) alsoconverges on the VTA. The VTA output is takenas a simple linear sum d(t) 5 r(t) 1 V̇(t). Thewidespread output connections of the VTA makethe prediction error d(t) simultaneously available tostructures constructing the predictions. (B) Tem-poral representation of a sensory cue. A cue like alight is represented at multiple delays xn from itsinitial time of onset, and each delay is associatedwith a separate adjustable weight wn. These pa-rameterswn are adjusted according to the correla-tion of activity xn and d and through training cometo act as predictions. This simple system storespredictions rather than correlations.

ARTICLES

http://www.sciencemag.org z SCIENCE z VOL. 275 z 14 MARCH 1997 1595

Page 4: ANeuralSubstrateof PredictionandReward - UCLdayan/papers/sdm97.pdf · calledthetemporaldifference(TD)algo-rithm (6). This algorithm was originally inspiredbybehavioraldataonhowanimals

(i), (ii), and (iii) listed above (Fig. 2A) (8,10, 12). As indicated in Fig. 2, the dopa-mine neurons receive highly convergent in-put from many brain regions. The modelrepresents the hypothesis that this inputarrives in the form of a surprise signal thatmeasures the degree to which the currentsensory state differs from the last sensorystate. We assume that the dopamine neu-rons’ output actually reflects d(t) 1 b(t),where b(t) is a basal firing rate (12). Figure3 shows the training of the model on a taskwhere a single sensory cue predicted thefuture delivery of a fixed amount of reward20 time steps into the future. The predic-tion error signal (top) matches the activityof the real dopamine neurons over thecourse of learning. The pattern of weightsthat develops (bottom) provide the model’sexplanations for two well-described behav-ioral effects—blocking and secondary con-ditioning (1). The model accounts for thebehavior of the dopamine neurons in avariety of other experiments in monkeys(12). The model also accounts for changesin dopaminergic activity if the time of thereward is changed (18).

The model makes two other testable pre-dictions: (i) in the presence of multiplesensory cues that predict reward, the phasic

activation of the neurons will transfer tothe earliest consistent cue. (ii) After train-ing on multiple sensory cues, omission of anintermediate cue will be accompanied by aphasic decrease in dopaminergic activity atthe time that the cue formerly occurred. Forexample, after training a monkey on thetemporal sequence light 13light 23re-ward, the dopamine neurons should respondphasically only to the onset of light 1. Atthis point, if light 2 is omitted on a trial, theactivity in the neurons will depress at thetime that light 2 would have occurred.

Choosing and criticizing actions. Weshowed above how the dopamine signal canbe used to learn and store predictions; how-ever, these same responses could also beused to influence the choice of appropriateactions through a connection with a tech-nique called dynamic programming (23).We discuss below the connection to dy-namic programming.

We introduce this use with a simpleexample. Suppose a rat must move througha maze to gain food. In the hallways of themaze, the rat has two options available to it:go forward a step or go backward a step. Atjunctions, the rat has three or four direc-tions from which to choose. At each posi-tion, the rat has various actions available to

it, and the action chosen will affect itsfuture prospects for finding its way to food.A wrong turn at one point may not be feltas a mistake until many steps later when therat runs into a dead end. How is the rat toknow which action was crucial in leading itto the dead end? This is called the temporalcredit assignment problem: Actions at onepoint in time can affect the acquisition ofrewards in the future in complicated ways.

One solution to temporal credit assign-ment is to describe the animal as adoptingand improving a “policy” that specifies howits actions are assigned to its states. Its stateis the collection of sensory cues associatedwith each maze position. To improve apolicy, the animal requires a means to eval-uate the value of each maze position. Theevaluation used in dynamic programming isthe amount of summed future reward ex-pected from each maze position providedthat the animal follows its policy. Thesummed future rewards expected from somestate [that is, V(t)] is exactly what the TDmethod learns, suggesting a connectionwith the dopamine signal.

As the rat above explores the maze, itspredictions become more accurate. The pre-dictions are considered “correct” once theaverage prediction error d(t) is 0. At thispoint, fluctuations in dopaminergic activityrepresent an important “economic evalua-tion” that is broadcast to target structures:Greater than baseline dopamine activitymeans the action performed is “better thanexpected” and less than baseline means“worse than expected.” Hence, dopamineresponses provide the information to imple-ment a simple behavioral strategy—take [orlearn to take (24)] actions correlated withincreased dopamine activity and avoid ac-tions correlated with decreases in dopamineactivity.

A very simple such use of d(t) as anevaluation signal for action choice is a formof learned klinokinesis (25), choosing oneaction while d(t) . 0, and choosing a newrandom action if d(t) # 0. This use of d(t)has been shown to account for bee foragingbehavior on flowers that yield variable re-turns (9, 11). Figure 4 shows the way inwhich TD methods can construct for a mo-bile “creature” a useful map of the value ofcertain actions.

A TD model was equipped with a simplevisual system (two, 200 by 200 pixel reti-nae) and trained on three different sensorycues (colored blocks) that differed in theamount of reward each contained (blue .green . red). The model had three neu-rons, each sensitive only to the percentageof one color in the visual field. Each color-sensitive neuron provides input to the pre-diction unit P (analog of VTA unit in Fig.2) through a single weight. Dedicating only

Pre

dic

tio

n e

rro

r

V

1

1

0.5

0.5

0

0

-0.5

-0.5

-1

-1

0

0

20

20

40

40

60

60

10

10

20

20

30

30

40

40

TrialTime

TrialTime

Fig. 3. Development of predictionerror signal through training. (Top)Prediction error (changes in dopa-mine neuron output) as a function oftime and trial. On each trial, a sen-sory cue is presented at time step10 and time step 20 followed byreward delivery [r(t)5 1] at time step60. On trial 0, the presentation ofthe two cues causes no change be-cause the associated weights areinitially set to 0. There is, however, astrong positive response (increasedfiring rate) at the delivery of rewardat time step 60. By repeating thepairing of the sensory cues followedin time by reward, the transient re-sponse of the model shifts to thetime of the earliest sensory cue(time step 10). Failure to deliver thereward during an intermediate trialcauses a large negative fluctuationin the model’s output. This wouldbe seen in an experiment as amarked decrease in spike output atthe time that reward should havebeen delivered. In this example, thetiming of reward delivery is learnedwell before any response transfersto the earliest sensory cue. (Bot-tom) The value function V(t). Theweights are all initially set to 0 (trial0). After the large prediction erroroccurs on trial 0, the weights beginto grow. Eventually they all saturate to 1 so that the only transient is the unpredicted onset of the firstsensory cue. The depression in the surface results from the error trial where the rewardwas not deliveredat the expected time.

SCIENCE z VOL. 275 z 14 MARCH 1997 z http://www.sciencemag.org1596

Page 5: ANeuralSubstrateof PredictionandReward - UCLdayan/papers/sdm97.pdf · calledthetemporaldifference(TD)algo-rithm (6). This algorithm was originally inspiredbybehavioraldataonhowanimals

a single weight to each cue limits this “crea-ture” to a one time step prediction on thebasis of its current state. After experiencingeach type of object multiple times, theweights reflect the relative amounts of re-ward in each object, that is, wb . wg . wr.These three weights equip the creature witha kind of cognitive map or “value surface”with which to assay its possible actions (Fig.4B).

The value surface above the arena is aplot of the value function V(x, y) (height)when the creature is placed in the indicatedcorner and looks at every position (x, y) inthe arena. The value V(x, y) of looking ateach position (x, y) is computed as a linearfunction of the weights (wb, wg, wr) associ-ated with activity induced in the color-sensitive units. As this “creature” changesits direction of gaze from one position (x0,y0) at time t to another position (x1, y1) attime t 1 1, the difference in the values ofthese two positions V(t 1 1) 2 V(t) isavailable as the output d(t) of the predic-tion neuron P. In this example, when thecreature looks from point 1 to point 2, thepercentage of blue in its visual field increas-es. This increase is available as a positivefluctuation (“things are better than expect-ed”) in the output d(t) of neuron P. Simi-larly, looking from point 2 to point 1 causesa large negative fluctuation in d(t) (“thingsare worse than expected”). As discussedabove, these fluctuations could be used bysome target structure to decide whether tomove in the direction of sight. Directionsassociated with a positive prediction errorare likely to yield increased future returns.

This example illustrates how only threestored quantities (weights associated witheach color) and the capacity to look atdifferent locations endow this simple “crea-ture” with a useful map of the quality ofdifferent directions in the arena. This samemodel has been given simple card-choicetasks analogous to those given to humans(26), and the model matches well the hu-man behavior. It is also interesting thathumans develop a predictive galvanic skinresponse that predicts appropriately whichcard decks are good and which are bad (26).

Summary and Future Questions

We have reviewed evidence that supportsthe proposal that dopamine neurons in theVTA and the substantia nigra report ongo-ing prediction errors for reward. The outputof these neurons is consistent with a scalarprediction error signal; therefore, the deliv-ery of this signal to target structures mayinfluence the processing of predictions andthe choice of reward-maximizing actions.These conclusions are supported by data onthe activity changes of these neurons during

the acquisition and expression of a range ofsimple conditioning tasks. This representa-tion of the experimental data raises a num-ber of important issues for future work.

The first issue concerns temporal repre-sentations, that is, how is any stimulus rep-resented through time? A large body ofbehavioral data show that animals can keeptrack of the time elapsed from the presen-tation of a CS and make precise predictionsaccordingly. We adopted a very simplemodel of this capacity, but experimentshave yet to suggest where or how the tem-poral information is constructed and usedby the brain. It is not yet clear how far intothe future such predictions can be made;however, one suspects that they will belonger than the predictions made by struc-tures that mediate cerebellar eyeblink con-ditioning and motor learning displayed bythe vestibulo-ocular reflex (27). The timescales that are ethologically important to aparticular creature should provide goodconstraints when searching for mechanismsthat might construct and distribute tempo-ral labels in the cerebral cortex.

A second issue is information aboutaversive events. The experimental data sug-gest that the dopamine system provides in-formation about appetitive stimuli, notaversive stimuli. It is possible however thatthe absence of an expected reward is inter-preted as a kind of “punishment” to someother system to which the dopamine neu-rons send their output. It would then be the

responsibility of these targets to pass outinformation about the degree to which thenondelivery of reward was “punishing.” Itwas long ago proposed that rewards andpunishments represent opponent processesand that the dynamics of opponency mightbe responsible for many puzzling effects inconditioning (28).

A third issue raised by the model is therelation between scalar signals of appetitivevalues and vector signals with many com-ponents, including those that represent pri-mary rewards and predictive stimuli. Simplemodels like the one presented above may beable to learn with a scalar signal only if thescope of choices is limited. Behavior inmore realistic environmental situations re-quires vector signaling of the type of re-wards and of the various physical compo-nents of the predictive stimuli. Without thecapacity to discriminate which stimuli areresponsible for fluctuations in a broadcastscalar error signal, an agent may learn in-appropriately, for example, it may learn toapproach food when it is actually thirsty.

Dopamine neurons emit an excellent ap-petitive error (teaching) signal without in-dicating further details about the appetitiveevent. It is therefore likely that other re-ward-processing structures subserve theanalysis and discrimination of appetitiveevents without constituting particularly ef-ficient teaching signals. This putative divi-sion of labor between the analysis of phys-ical and functional attributes and scalar

Fig. 4. Simple cognitive maps canbe easily built and used. (A) Archi-tecture of the TD model. Three col-or-sensitive units (b, g, r) report, re-spectively, the percentage of blue,green, and red in the visual field.Each unit influences neuron P ( VTAanalog) through a single weight.The colored blocks contain varyingamounts of reward with blue. green . red. After training, theweights (wb, wg, wr) reflect this dif-ference in reward content. Usingonly a single weight for each senso-ry cue, the model can make onlyone-time step predictions; howev-er, combined with its capacity tomove its head or walk about thearena, a crude “value-map” is avail-able in the output d(t) of neuron P.(B) Value surface for the arenawhen the creature is positioned inthe corner as indicated. The heightof the surface codes for the valueV(x, y) of each location when viewedfrom the corner where the “crea-ture” is positioned. All the creatureneeds to do is look from one loca-tion to another (or move from oneposition to another), and the differences in value V(t1 1)2 V(t) are coded in the changes in the firing rateof P (see text).

Visual input

Reward

r (t ) δ(t )

Biased action

selection

P

gb r

wb wg wr

Blue RedGreen

2 1

High R

Low R

Med. R

x

x

1

*

2

*

A

B

ARTICLES

http://www.sciencemag.org z SCIENCE z VOL. 275 z 14 MARCH 1997 1597

Page 6: ANeuralSubstrateof PredictionandReward - UCLdayan/papers/sdm97.pdf · calledthetemporaldifference(TD)algo-rithm (6). This algorithm was originally inspiredbybehavioraldataonhowanimals

evaluation signals raises a fourth issue—attention.

The model does not address the atten-tional functions of some of the innervatedstructures, such as the nucleus accumbensand the frontal cortex. Evidence suggeststhat these structures are important for casesin which different amounts of attention arepaid to different stimuli. There is, however,evidence to suggest that the required atten-tional mechanisms might also operate atthe level of the dopamine neurons. Theirresponses to novel stimuli will decrementwith repeated presentation and they willgeneralize their responses to nonappetitivestimuli that are physically similar to appet-itive stimuli (29). In general, questionsabout attentional effects in dopaminergicsystems are ripe for future work.

The suggestions that a scalar prediction-error signal influences behavioral choicesreceives support from the preliminary workon human decision-making and from thefact that changes in dopamine activity fluc-tuations parallel changes in the behavioralperformance of the monkeys (30). In themammalian brain, the striatum is one sitewhere this kind of scalar evaluation couldhave a direct effect on action choice, andactivity relating to conditioned stimuli isseen in the striatum (31). The widespreadprojection of dopamine axons to striatalneurons gives rise to synapses at dendriticspines that are also contacted by excitatoryinputs from cortex (32). This may be a sitewhere the dopamine signal influences be-havioral choices by modulating the level ofcompetition in the dorsal striatum. Phasicdopamine signals may lead to an augmen-tation of excitatory influences in the stria-tum (33), and there is evidence for striatalplasticity after pulsatile application of do-pamine (34). Plasticity could mediate thelearning of appropriate policies (24).

The possibilities in the striatum for usinga scalar evaluation signal carried by changesin dopamine delivery are complemented byinteresting possibilities in the cerebral cor-tex. In prefrontal cortex, dopamine deliveryhas a dramatic influence on working mem-ory (35). Dopamine also modulates cogni-tive activation of anterior cingulate cortexin schizophenic patients (36). Clearly, do-pamine delivery has important cognitiveconsequences at the level of the cerebralcortex. Under the model presented here,changes in dopaminergic activity distributeprediction errors to widespread target struc-tures. It seems reasonable to require that theprediction errors be delivered primarily tothose regions most responsible for makingthe predictions; otherwise, one cortical re-gion would have to deal with predictionerrors engendered by the bad guesses ofanother region. From this point of view,

one could expect there to be a mechanismthat coupled local activity in the cortex toan enhanced sensitivity of nearby dopamineterminals to differences from baseline inspike production along their parent axon.There is experimental evidence that sup-ports this possibility (37).

Neuromodulatory systems like dopaminesystems are so named because they werethought to modulate global states of thebrain at time scales and temporal resolu-tions much poorer than other systems likefast glutamatergic connections. Althoughthis global modulation function may be ac-curate, the work discussed here shows thatneuromodulatory systems may also deliverprecisely timed information to specific tar-get structures to influence a number of im-portant cognitive functions.

REFERENCES AND NOTES___________________________

1. A. Dickinson, Contemporary Animal Learning Theory(Cambridge Univ. Press, Cambridge, 1980); N. J.Mackintosh, Conditioning and Associative Learning(Oxford Univ. Press, Oxford, 1983); C. R. Gallistel,The Organization of Learning (MIT Press, Cam-bridge, MA, 1990); L. A. Real, Science 253, 980(1991).

2. I. P. Pavlov, Conditioned Reflexes (Oxford Univ.Press, Oxford, 1927); B. F. Skinner, The Behavior ofOrganisms (Appleton-Century-Crofts, New York,1938); J. Olds, Drives and Reinforcement (Raven,New York 1977); R. A. Wise, in The Neuropharma-cological Basis of Reward, J. M. Liebeman and S. J.Cooper, Eds. (Clarendon Press, New York, 1989); N.W. White and P. M. Milner, Annu. Rev. Psychol. 43,443 (1992); T. W. Robbins and B. J. Everitt, Curr.Opin. Neurobiol. 6, 228 (1996).

3. R. A. Rescorla and A. R. Wagner, in Classical Con-ditioning II: Current Research and Theory, A. H.Black and W. F. Prokasy, Eds. (Appleton-Century-Crofts, New York, 1972), pp. 64–69.

4. N. J. Mackintosh,Psychol. Rev. 82, 276 (1975); J. M.Pearce and G. Hall, ibid. 87, 532 (1980).

5. L. J. Kamin, in Punishment and Aversive Behavior, B.A. Campbell and R. M. Church, Eds. (Appleton-Cen-tury-Crofts, New York (1969), pp. 279–296.

6. R. S. Sutton and A. G. Barto, Psychol. Rev. 88 (no.2), 135 (1981); R. S. Sutton, Mach. Learn. 3, 9(1988).

7. R. S. Sutton and A. G. Barto, Proceedings of theNinth Annual Conference of the Cognitive ScienceSociety (Seattle, WA, 1987); in Learning and Com-putational Neuroscience, M. Gabriel and J. Moore,Eds. (MIT Press, Cambridge, MA, 1989). For specificapplication to eyeblink conditioning, see J. W. Mooreet al., Behav. Brain Res. 12, 143 (1986).

8. S. R. Quartz, P. Dayan, P. R. Montague, T. J. Sej-nowski, Soc. Neurosci. Abstr. 18, 1210 (1992); P. R.Montague, P. Dayan, S. J. Nowlan, A. Pouget, T. J.Sejnowski, in Advances in Neural Information Pro-cessing Systems 5, S. J. Hanson, J. D. Cowan, C. L.Giles, Eds. (Morgan Kaufmann, San Mateo, CA,1993), pp. 969–976.

9. P. R. Montague, P. Dayan, T. J. Sejnowski, in Ad-vances in Neural Information Processing Systems 6,G. Tesauro, J. D. Cowan, J. Alspector, Eds. (MorganKaufmann, San Mateo, CA, 1994), pp. 598–605.

10. P. R. Montague and T. J. Sejnowski, Learn. Mem. 1,1 (1994); P. R. Montague, Neural-Network Ap-proaches to Cognition—Biobehavioral Foundations,J. Donahoe, Ed. (Elsevier, Amsterdam, in press); P.R. Montague and P. Dayan, A Companion to Cogni-tive Science, W. Bechtel and G. Graham, Eds.(Blackwell, Oxford, in press).

11. P. R. Montague, P. Dayan, C. Person, T. J. Sej-nowski, Nature 377, 725 (1995).

12. P. R. Montague, P. Dayan, T. J. Sejnowski, J. Neu-rosci. 16, 1936 (1996).

13. Other work has suggested an interpretation ofmonoaminergic influences similar to that takenabove (8–12) [K. J. Friston, G. Tononi, G. N. Reeke,O. Sporns, G. M. Edelman, Neuroscience 59, 229(1994); J. C. Houk, J. L. Adams, A. G. Barto, inModels of Information Processing in the Basal Gan-glia, J. C. Houk, J. L. Davis, D. G. Beiser, Eds. (MITPress, Cambridge, MA, 1995)], pp. 249–270. Othermodels of monoaminergic influences have consid-ered what could be called attention-based accounts(4) rather than prediction error–based explanations[D. Servan-Schreiber, H. Printz, J. D. Cohen, Sci-ence 249, 892 (1990)].

14. G. F. Koob, Semin. Neurosci. 4, 139 (1992); R. A.Wise and D. C. Hoffman, Synapse 10, 247 (1992); G.DiChiara, Drug Alcohol Depend. 38, 95 (1995).

15. A. G. Phillips, S. M. Brooke, H. C. Fibiger, Brain Res.85, 13 (1975); A. G. Phillips, D. A. Carter, H. C.Fibiger, ibid. 104, 221 (1976); F. Mora and R. D.Myers, Science 197, 1387 (1977); A. G. Phillips, F.Mora, E. T. Rolls, Psychopharmacology 62, 79(1979); D. Corbett and R. A. Wise, Brain Res. 185, 1(1980); R. A. Wise and P.-P. Rompre, Annu. Rev.Psychol. 40, 191 (1989).

16. R. A. Wise, Behav. Brain Sci. 5, 39 (1982); R. J.Beninger, Brain Res. Rev. 6, 173 (1983);iiii andB. L. Hahn, Science 220, 1304 (1983); R. J. Ben-inger,Brain Res. Bull. 23, 365 (1989); M. LeMoal andH. Simon, Physiol. Rev. 71, 155 (1991); T. W. Rob-bins and B. J. Everitt, Semin. Neurosci. 4, 119(1992).

17. W. Schultz, J. Neurophysiol. 56, 1439 (1986); R.Romo and W. Schultz, ibid. 63, 592 (1990); W.Schultz and R. Romo, ibid., p. 607; T. Ljungberg, P.Apicella, W. Schultz, ibid. 67, 145 (1992); W.Schultz, P. Apicella, T. Ljungberg, J. Neurosci. 13,900 (1993); J. Mirenowicz and W. Schultz, J. Neu-rophysiol. 72, 1024 (1994); W. Schultz et al., inMod-els of Information Processing in the Basal Ganglia, J.C. Houk, J. L. Davis, D. G. Beiser, Eds. (MIT Press,Cambridge, MA, 1995), pp. 233–248; J. Mirenowiczand W. Schultz, Nature 379, 449 (1996).

18. Recent experiments showed that the simple displace-ment of the time of reward delivery resulted in dopa-mine responses. In a situation in which neurons werenot driven by a fully predicted drop of juice, activationsreappeared when the juice reward occurred 0.5 searlier or later than predicted. Depressions were ob-served at the normal time of juice reward only if rewarddelivery was late [J. R. Hollerman and W. Schultz,Soc. Neuroci. Abstr. 22, 1388 (1996)].

19. G. Tesauro, Commun. ACM 38, 58 (1995); D. P.Bertsekas and J. N. Tsitsiklis, Neurodynamic Pro-gramming (Athena Scientific, Belmont, NJ, 1996).

20. R. M. Church, in Contemporary Learning Theories:Instrumental Conditioning Theory and the Impact ofBiological Constraints on Learning, S. B. Klein and R.R. Mowrer, Eds. (Erlbaum, Hillsdale, NJ, 1989), p.41; J. Gibbon, Learn. Motiv. 22, 3 (1991).

21. S. Grossberg and N. A. Schmajuk, Neural Networks2, 79 (1989); S. Grossberg and J. W. L. Merrill, Cog-nit. Brain Res. 1, 3 (1992).

22. P. Dayan,Mach. Learn. 8, 341 (1992); P. Dayan andT. J. Sejnowski, ibid. 14, 295 (1994); T. Jaakkola, M.I. Jordan, S. P. Singh, Neural Computation 6, 1185(1994).

23. R. E. Bellman, Dynamic Programming (PrincetonUniv. Press, Princeton, NJ, 1957); R. A. Howard,Dynamic Programming and Markov Processes (MITPress, Cambridge, MA, 1960).

24. A. G. Barto, R. S. Sutton, C. W. Anderson, IEEETrans. Syst. Man Cybernetics 13, 834 (1983).

25. Bacterial klinokinesis has been described in greatdetail. Early work emphasized the mechanisms re-quired for bacteria to climb gradients of nutrients.See R. M. Macnab and D. E. Koshland, Proc. Natl.Acad. Sci. U.S.A. 69, 2509 (1972); N. Tsang, R.Macnab, D. E. Koshland Jr., Science 181, 60 (1973);H. C. Berg and R. A. Anderson, Nature 245, 380(1973); H. C. Berg ibid. 254, 389 (1975); J. L. Spu-dich and D. E. Koshland, Proc. Natl. Acad. Sci.U.S.A. 72, 710 (1975). The klinokinetic action-selec-tion mechanism causes a TD model to climb hills

SCIENCE z VOL. 275 z 14 MARCH 1997 z http://www.sciencemag.org1598

Page 7: ANeuralSubstrateof PredictionandReward - UCLdayan/papers/sdm97.pdf · calledthetemporaldifference(TD)algo-rithm (6). This algorithm was originally inspiredbybehavioraldataonhowanimals

defined by the sensory weights, that is, themodel willclimb the surface defined by the value function V.

26. A. R. Damasio,Descartes’ Error (Putnam, New York,1994); A. Bechara, A. R. Damasio, H. Damasio, S.Anderson, Cognition 50, 7 (1994).

27. S. P. Perrett, B. P. Ruiz, M. D. Mauk J. Neurosci. 13,1708 (1993); J. L. Raymond, S. G. Lisberger, M. D.Mauk Science 272, 1126 (1996).

28. S. Grossberg, Math. Biosci. 15, 253 (1972); R. L.Solomon and J. D. Corbit, Psychol. Rev. 81, 119(1974); S. Grossberg, ibid. 89, 529 (1982).

29. W. Schultz and R. Romo, J. Neurophysiol. 63, 607(1990); T. Ljungberg, P. Apicella, W. Schultz, ibid.67, 145 (1992); J. Mirenowicz and W. Schultz, Na-ture 379, 449 (1996).

30. W. Schultz, P. Apicella, T. Ljungberg, J. Neurosci.13, 900 (1993).

31. T. Aosaki et al., ibid. 14, 3969 (1994); A. M. GraybielCurr. Opin. Neurobiol. 5, 733 (1995); Trends Neuro-sci. 18, 60 (1995). Recent models of sequence gen-eration in the striatum use fluctuating dopamine in-put as a scalar error signal [G. S. Berns and T. J.Sejnowski, in Neurobiology of Decision Making, A.

Damasio, Ed. (Springer-Verlag, Berlin, 1996), pp.101–113.

32. T. F. Freund, J. F. Powell, A. D. Smith, Neuroscience13, 1189 (1984); Y. Smith, B. D. Bennett, J. P. Bo-lam, A. Parent, A. F. Sadikot, J. Comp. Neurol. 344,1 (1994).

33. C. Cepeda, N. A. Buchwald, M. S. Levine,Proc. Natl.Acad. Sci. U. S. A. 90, 9576 (1993).

34. J. R. Wickens, A. J. Begg, G. W. Arbuthnott, Neuro-science 70, 1 (1996).

35. P. S. Goldman-Rakic, C. Leranth, M. S. Williams, N.Mons, M. Geffard, Proc. Natl. Acad. Sci. U.S.A. 86,9015 (1989); T. Sawaguchi and P. S. Goldman-Ra-kic, Science 251, 947 (1991); G. V. Williams and P.S. Goldman-Rakic, Nature 376, 572 (1995).

36. R. J. Dolan et al., Nature, 378 180 (1995).37. P. R. Montague, C. D. Gancayco, M. J. Winn, R. B.

Marchase, M. J. Friedlander, Science 263, 973(1994). The mechanistic suggestion requires that lo-cal cortical activity (presumably glutamatergic) in-creases the sensitivity of nearby dopamine terminalsto differences from baseline in spike production

along their parent axon. This may result from localincreases in nitric oxide production. In this manner,baseline dopamine release remains constant in inac-tive cortical areas while active cortical areas feelstrongly the effect of increases and decreases indopamine delivery due to increases and decreasesin spike production along the parent dopamine axon.

38. We thank A. Damasio and T. Sejnowski for com-ments and criticisms, and C. Person for help ingenerating figures. The theoretical work receivedcontinuing support from the Center for TheoreticalNeuroscience at Baylor College of Medicine andthe National Institutes of Mental Health (NIMH)(P.R.M.). P.D. was supported by MassachusettsInstitute of Technology and the NIH. The primatestudies were supported by the Swiss National Sci-ence Foundation, the McDonnell-Pew Foundation(Princeton), the Fyssen Foundation (Paris), the Fon-dation pour la Recherche Midicale (Paris), the Unit-ed Parkinson Foundation (Chicago), the Roche Re-search Foundation (Basel), the NIMH (Bethesda),and the British Council.

Language Acquisition and Use:Learning and ApplyingProbabilistic Constraints

Mark S. Seidenberg

What kinds of knowledge underlie the use of language and how is this knowledge ac-quired? Linguists equate knowing a language with knowing a grammar. Classic “povertyof the stimulus” arguments suggest that grammar identification is an intractable inductiveproblem and that acquisition is possible only because children possess innate knowledgeof grammatical structure. An alternative view is emerging from studies of statistical andprobabilistic aspects of language, connectionist models, and the learning capacities ofinfants. This approach emphasizes continuity between how language is acquired and howit is used. It retains the idea that innate capacities constrain language learning, but callsinto question whether they include knowledge of grammatical structure.

Modern thinking about language has beendominated by the views of Noam Chomsky,who created the generative paradigm with-in which most research has been conductedfor over 30 years (1). This approach con-tinues to flourish (2), and although alterna-tive theories exist, they typically shareChomsky’s assumptions about the nature oflanguage and the goals of linguistic theory(3). Research on language has arrived at aparticularly interesting point, however, be-cause of important developments outside ofthe linguistic mainstream that are converg-ing on a different view of the nature oflanguage. These developments represent animportant turn of events in the history ofideas about language.

The Standard Theory

The place to begin is with Chomsky’s clas-sic questions (4): (i) what constitutesknowledge of a language, (ii) how is thisknowledge acquired, and (iii) how is it put

to use? The standard theory provides thefollowing answers (1–5).

In answer to the first question, what oneknows is a grammar, a complex system ofrules and constraints that allows people todistinguish grammatical from ungrammati-cal sentences. The grammar is an idealiza-tion that abstracts away from a variety ofso-called performance factors related to lan-guage use. The Competence Hypothesis isthat this idealization will facilitate the iden-tification of generalizations about linguisticknowledge that lie beneath overt behavior,which is affected by many other factors.Many phenomena that are prominent char-acteristics of language use are therefore setaside. The clear cases that are often cited inseparating competence from performanceinclude dysfluencies and errors. In practice,however, the competence theory also ex-cludes other factors that affect language use,including the nature of the perceptual andmotor systems that are used; memory capac-ities that limit the complexity of utterances

that can be produced or understood; andreasoning capacities used in comprehendingtext or discourse. The competence theoryalso excludes information about statisticaland probabilistic aspects of language—forexample, the fact that verbs differ in howoften they occur in transitive and intransi-tive sentences (“John ate the candy” versus“John ate,” respectively), or the fact thatwhen the subject of the verb “break” isanimate, it is typically the agent of theaction, but when it is inanimate, it is typi-cally the entity being broken (compare“John broke the glass” with “The glassbroke”). That this information should beexcluded was the point of Chomsky’s fa-mous sentence “Colorless green ideas sleepfuriously” and the accompanying observa-tion that, “I think that we are forced toconclude that . . . probabilistic models giveno particular insight into some of the basicproblems of syntactic structure” (6). Finally,the competence theory also disregards thecommunicative functions of language andhow they are achieved. These aspects oflanguage are acknowledged as importantbut considered separable from core gram-matical knowledge.

The grammar’s essential properties in-clude generativity (it can be used to pro-duce and comprehend an essentially infi-nite number of sentences); abstractness ofstructure (it uses representations that arenot overtly marked in the surface forms ofutterances); modularity (the grammar is or-ganized into components with differenttypes of representations governed by differ-ent principles); and domain specificity (lan-guage exhibits properties that are not seenin other aspects of cognition; therefore, itcannot be an expression of general capaci-ties to think and to learn).

The second question regarding language

Neuroscience Program, University of Southern California,Los Angeles, CA 90089–2520, USA. E-mail: [email protected]

ARTICLES

http://www.sciencemag.org z SCIENCE z VOL. 275 z 14 MARCH 1997 1599


Recommended