Closure, recency, and activation-based syntactic...

Closure, recency, and activation-based syntactic memory

Ling792aFeb 25, 2013

1 Where we are

Last class, we dissected two of Kimball’s parsing principles that together gave uswhat we called a forgetting procedure: a scheme by which syntactic structure thatwas no longer necessary was jettisoned from active memory during the process ofconstructing a parse. Today we are going to brie�y review Kimball’s forgetting pro-cedure, update assumptions about the timing of structural ‘forgetting’, and matchthe model against some empirical results and intuitive judgments of di�culty.

Next we will turn to more contemporary proposals for the character of syn-tactic memory: the Visibility Hypothesis of Frazier and Cli�on (1998) and theACT-R parser of Lewis and Vasishth (2005). �e key theoretical innovation ofthese models for our purposes is the adoption of a notion of graded availability ofinformation in syntactic memory.

1.0.1 Kimball’s Closure, and Processing

Previously, we assumed a model that parses according to a top-down node postula-tion strategy. One principle of Kimball’s system that we did not examine in detailwas New Nodes, a principle that regulated the timing of node posulation:

(1) New Nodes:�e construction of a new node is signaled by a grammaticalfunction word.

In other words, Kimball suggests a slight departure from a strict top-down strategy.In positing the New Nodes principle, Kimball suggests that the top-down postu-lation of nodes is gated by function words in the bottom up input. �is idea issimilar to a very popular parsing strategy known as le�-corner parsing (see, e.g.,Resnik (1992)).�is parsing strategy remains in use in a number of contemporaryparsing models, including the ACT-R parser we will see later today. As it will formthe basis of the formal parsing models we will work with for the remainder of thecourse, it will be useful to brie�y review a parse using a le�-corner system.

Consider a toy grammar with the following rules:

(2) a. TP→ DP VPb. DP→ Det NPc. NP→ Nd. VP→ Ve. Det→ thef. N→ dog

. V→ barksLe�-corner parsing schemes mix aspects of top-down and bottom-up strate-

gies. Like top-down strategies, a node may be entered into the parse before it hasbeen completed. Like bottom-up strategies, the parser requires some bottom-upinput to generate structure. Le�-corner strategies may be described informally inthe following fashion: given a rule Y → x , x , ...xn , if x is entered into the parse,then the le�-hand side of the rule, Y , may also be entered into the parse. Becausethe dominating node is projected from the le�-most symbol on the right, thisstrategy is called le�-corner.�is le� corner projection is o�en (but not always)applied recursively.

Let’s see how this works.

(3) Input: w �e w dog w barks. w

At w, we have seen no input. Because the le� corner parser does not projectstructure until triggered by bottom-up input, no parser action is taken.

At w, the parser recognizes the. �is is the le� corner of the determinerrule, and it completes a determiner constituent.�is newly-posited determinerconstituent is then recognized as the le� corner of the DP rule, leading to thepostulation of a DP node.�is DP node is then recognized as the le� corner of theTP rule, triggering the postulation of a DP node.�erefore, the parser projects thefollowing structure upon reaching the:

Ling792a Closure, recency, and activation-based syntactic memory Feb 25, 2013

(4) TP

DP

Det

the

We’ve projected all we can and we move to w.�is allows us to le�-corner project,and complete, an NP.�is NP is attached into the DP node:

(5) TP

DP

NP

N

dog

Det

the

And we’ve exhaused the parser actions we can take at this moment. We can anotherword at w: barked.�is is recognized as a V, which is a completing le� corner ofVP, which is then attached into the tree:

(6) TP

VP

V

barks

DP

NP

N

dog

Det

the

From here on out, we’ll assume that the parser inputs nodes into the parse in afashion described by a le� corner parser.

We ended last time by investigating thememory structure of the Kimball parser.We saw that the parser consisted of two memory stores: an active parsing unit

and a push-down store. Nodes were �ushed from active memory, and put in a lessaccessible ‘processing unit’ according to the principles of Closure and Processing:

(7) (Early) Closure: A phrase is closed as soon as possible, i.e., unless the nextnode parsed is an immediate constituent of the phrase.

(8) Processing: When a phrase is closed, it is pushed into a syntactic (possiblysemantic) processing stage and cleared from short-term memory.

�ese two principles describe when a node is closed by the parser (as soon aspossible!), and what happens when closure occurs. Processing dictates that whena node is closed, it is pushed down into a ’processing unit’, or a secondary memorystore.

Recall that the Kimball closure principle is rather overeager, which led Frazier(1978) to call this principle Early Closure. It is easy to see that if we adopt a le�corner parsing scheme, the grammar above, and Kimball’s Early Closure, then wequickly get ourselves into trouble:

(9) Step One: TP

DP

Det

the

Step Two: NP

N

dog

Evidently, NP is not an immediate constituent of IP. On Kimball Early Closure,TP would be closed and removed from the tree, leaving us without structure toattach the VP to!

More generally, this re�ects the problem of premature closure. �is issue,broadly speaking, formed part of Frazier (1978)’s critique of Kimball’s Early Closurestrategy. Obviously we must amend Kimball Early Closure to avoid this problem.One way to do this is to suppose that the parser can delay closure of a node X bychecking the production rules for node X, and determining if X has obligatorydaughters that have not yet been recognized.

Or, we could assume the closure strategy implied by Late Closure Frazier (1978).Late Closure is re�ects a more conservative, approach to determining node closurepoints that relies heavily on the bottom-up input.

(10) Late Closure: When possible, attach incoming material into the clause orphrase currently being processed.

2


Late Closure states that the parser should attempt to attempt material into thecurrent node (the lowest node at the right edge) whenever grammatically possible.�is implies that a node X is kept open until a terminal node is reached that is notdominated by X. Call this the Hard Right Edge of Late Closure:

(11) Late Closure’sHard Right Edge: A phrase XP is kept open until a right edgeis found, i.e., until the next node parsed is not dominated by XP.

It is interesting to observe that when parsing with a context-free grammar, thenHard Right Edge is a risk-free strategy. �is is because any XP generated by acontext-free grammar will span a contiguous substring of the input. If some inputis recognized that is not dominated by some node XP, then it follows that XPis closed. Of course, it is widely assumed that natural languages are not strictlycontext-free, and so it is not generally the case that the set of terminal nodes domi-nated by some XP will span a contiguous substring in the input. Late closure can’tmean ‘keep a node open until the grammar forces you to close it’, because in gram-mars that allow movement to the right (or otherwise discontinuous constituents),the grammar will never force you to do so.

�is is the mess that movement makes of our nice underlying constituent struc-tures. So we might observe then that Hard Right Edge, while perhaps a reasonablestrategy, is not without risk.

Of course, there are other intermediate options for determining closure points.For example, one might adopt a more data-driven, probabilistic approach to deter-mining ideal closure points (Yamada and Matsumoto (2003), Nivre and Scholz(2004)). We also hypothesize that prosodic cues help to determine closure points.We don’t need to settle this today, but we’ll want to remember that the decision toclose a phrase will in part control the predictions of parser for the current model.Let’s assume Late Closure with a Hard Right Edge for the time being.

1.1 Predictions of Closure vs.�e Data

So far our model just has one process that makes structure less accessible: closureof nodes and a two-state parser memory. Let’s verify whether this is enough to �tthe data we have seen so far.

Recall the results of Caplan (1972). He observed faster access to the probe word(e.g. oil) when it was in the immediate clause, and access was somewhat slower inthe later clause:

(12) a. Now that the artists are working in oil, prints are rare.b. Now that the artists are working fewer hours, oil prints are rare.

Recall that a similar pattern of results was seen in the Gernsbacher et al. (1989)study, albeit with somewhat di�erent structures:

(13) a. As Lisa set up the tent, Tina gathered the kindling.b. As Lisa and Tina set up the tent, Dave and Tom gathered the kindling.

Let’s look at Caplan’s sentences in (12). In the distant probe version, we can safelyassume that the constituents that comprise the �rst adjunct clause will have beenclosed and moved to the processing unit store by the time the end of the sentenceis reached.�is predicts that syntactic information, including the lexical, terminalnodes, will be more di�cult to access at this point.�is is con�rmed by the data.

But what about oil in (12-b)? First, note that on either Late Closure or EarlyClosure, the subject DP will have been recognized and closed as soon as T is rec-ognized.�is means that the constituent containing oil is in the PU by the timethe end of the sentence is reached. Because the PU has a �at structure for both (12)examples, the constituent containing oil will be equally accessible.�at is, it willlook something like the following:

(14) Active: TP

TP

T

are

DP

PU: DP

(oil) prints

... VP

working (fewer hours) / (in oil)

�at is, oil is in the same memory state for both distant and local conditions, at thepoint when the probe is presented.�is result obtains for both Early Closure andLate Closure strategies.�e same appears to be true of the Gernsbacher sentences,as both probe words are in subject position. Because of its speci�er position, thesubject is always closed and processed upon recognizing T on both Late and EarlyClosure strategies.

It appears that our parsing model is as yet empirically inadequate. Our modelpredicts no di�erence between the distant and local examples in the Caplan studies.�e problem holds with the Gernsbacher examples with conjoined names as in(13-b):

3


Figure 1: Data from Gernsbacher et al. (1989)�e Gernsbacher et al data indicate that the �rst conjunct is more available

than is the second conjunct. On our current closure-based parsing model, wewould need to ensure that the parser closes only the second conjunct in the &P,not the �rst, at the end of the sentence.�is does not follow naturally from any ofthe purely con�gurational closure principles we have entertained until now.

We seem to be unable to �t the existing data, and we haven’t seen very muchof that. �is suggests that our parser’s memory store is too coarse to provide agood model of human syntactic memory. I believe this is because we have positedonly two memory states that any given node may be in: it may be active, or sup-pressed. We may want to allow for �ner gradations in the availability of syntacticinformation in the parser’s memory.�is is the option that parsing theorists havegenerally pursued since Kimball. We will see there is good empirical motivationfor this theoretical development.

1.2 Recency e�ects: Late Closure and Right Association

�e data that suggest a more �ne grained notion of availability of information inmemory was already evident in Kimball’s system, however. Let’s consider intu-itive judgments about the role of that recency syntactic comprehension. Kimballpresented a number of data points that indicate a strong recency preference insyntactic comprehension. Let’s review some of the relevant data. Consider thefollowing ambiguous sentence (Kimball (1963), p. 24):

(15) a. Joe �gured that Susan wanted to take the train to New York out.

In principle, the �nal out may be a verbal particle associated with either the lowestVP (take X out) or the highest VP (�gure X out). Intuition suggests that the former

is much easier to get than the former. �is recency preference holds even withunambiguous structures (

(16) a. I thought the request over.b. *I thought the request of the astronomer who was trying at the same

time to count the constellations on his toes without taking his shoeso� over.

Note that I am using * here to indicate a degree of unacceptability typically asso-ciated with ungrammatical sentences, rather than using it to make a claim abouta particular string being ungrammatical. One intuits that (16-b) remains unac-ceptable with any amount of exposure, despite that it is apparently a processingdi�culty.

Kimball also o�ers a case of rightward extraposition from DP as an example ofrecency preferences in parsing:

(17) a. �e girl took the job that was attractiveb. �e girl went to NY that was attractive

Intuition suggests that (17-a) is preferentially parsed with an in-situ RC that modi-�es the job, rather than the girl. However, the former parse is not ungrammatical,as evidenced by (17-b). Instead, it appears that the con�guration in (17-a) biasescomprehenders towards the local interpretation of the RC.

As a �nal example, one observes a recency e�ect with the association of am-biguous temporal adverbials (Frazier (1978), p. 53):

(18) a. Joe said that Martha expected that it would rain yesterday.

In Kimball’s model, he initially supposed that these recency preferences mightsimply re�ect early closure. Early Closure would have pushed the higher VPs intothe suppressed memory pile:

(19) Active: VP

V

rain

PU: VP

CPV

said

VP

CPV

expected

...

�is model of recency predicts two states of availability, something that currentlyappears empirically inadequate. We might also note that our Late Closure strategyprecludes us from reducing recency e�ects to closure e�ects. In a long right branch

4


of the sort shown in these examples, we never trigger the application of Closure,and so all of our dominating nodes are kept active!

(20) Parsed: TP

VP

CP

TP

VP

TP

to take the train

wanted

DP

Susan

C

that

V

�gured

DP

Joe

Incoming: outOur only model of structural forgetting, then, does not account for recency

e�ects. Kimball’s solution was to posit a separate principle governing recency.�iscomes in the form of Kimball’s Right Association, which is similar to Late Closure:

(21) Right Association: Terminal nodes optimally associate to the lowest non-terminal node.

(22) Late Closure: When possible, attach incoming material into the (minimal)clause or phrase currently being processed.

As formulated, these are both constraints that cast recency as a binary phenomenon.Either you are recent (in the immediate open node), or not. We saw earlier thatthis binary system prevented us from capturing the probe data from Gernsbacheret al. As stated, it will not capture the Caplan data either.

1.2.1 n-way ambiguities from Frazier (1978),Church (1980), and Gibson et al. (1996)

�e probe data we have seen cannot be accounted for with a simple, binarymemorystore and a closure principle. It is possible that with an appropriately formulatedprinciple governing recency, we may be able account for the recency judgmentsreported above, and the probe data. It is also possible that the recency principlemight only admit two states of availability, recent and non-recent.

However, intuitive judgments and experimental data on syntactic attachmentambiguities with more than two possible attachment sites support more than twounderlying states of availability of attachment sites.�e �rst and most widely citedexample of a three-way distinction in syntactic availability is the following sentencefrom Frazier (1978) and Kimball (1973):

(23) a. Joe saidV that Martha expectedV that it would rainV yesterday.

Kimball observed that although the VP headed by the most local V V is muchpreferred as the host for yesterday, and he further claimed that one observes adistinction in the preferences for attachment to more distant attachment. His judg-ments indicate a preference for V over V. However, Frazier (1978) indicates thather informants preferred a V attachment over a V attachment.�is judgmentcorroborates my own intuition.�is leads to the following three way distinctionin attachment preferences:

(24) a. Lowest VP > Highest VP > Middle VP

Gibson et al. (1996) investigated qualitatively similar ambiguities with RC attach-ment in Spanish and English.�ey presented English and Spanish speakers withcomplex DP fragments, and used a stops-making-sense task. Participants read thefollowing fragments word by word, and hit a button if they perceived an ungram-maticality in the DP fragment.

(25) a. High attachment: Las lamparas cerca de la pintura de la casa quefueron danadas en la inundacion.

b. Mid attachment: La lampara cerca de las pinturas de la casa que fuerondanadas en la inundacion.

c. Low attachment: La lampara cerca de la pintura de las casas que fuerondanadas en la inundacion.

(26) a. High attachment:�e lamps near the painting of the house that weredamaged in the �ood.

b. Mid attachment:�e lamp near the paintings of the house that weredamaged in the �ood.

c. Low attachment:�e lamp near the painting of the houses that weredamaged in the �ood.

For both English and Spanish speakers, Gibson et al. (1996) observed that partic-ipants rejected Low attachments at low rates starting at the RC region, slightlyhigher rates for high attachment, and at even higher rates for the middle attach-

5


ment.�is is the same non-monotonic pattern of recency preferences reported byFrazier (1978): Low > High > Mid.

Figure 2: Data from Gibson et al. (1996)Church (1980) gives us a �nal set of data that tests the availability of structure in

syntactic memory, using the application of Principle B (or rather, the detection ofthe repeated name penalty) and the attachment of rationale clauses as diagnostics.

(27) a. High attachment: Billy said that Susan denied that the kids told a lieto get himself out of hot water.

b. Mid attachment: Billy said that Susan denied that the kids told a lie toget herself out of hot water.

c. Low attachment: Billy said that Susan denied that the kids told a lieto get themselves out of hot water.

Church reports a monotonic ordering preference for Rationale Clause attachment,contra the RC attachment �ndings in Gibson et al. (1996) and adverbial modi�erattachment in Frazier (1978). I believe I share this preference; I do not know if thisparticular paradigm has been experimentally veri�ed. If RatCs form an exceptionto the non-monotonic preferences observed above, then this is a bit of a puzzle.

One �nal example comes from repeated name penalty judgments. Churchhypothesizes that our ability to detect a repeated name violation falls o� as the syn-tactic availability of the antecedent decreases. He provides the following paradigm:

(28) a. Recent Repeated Name: *Johni told the teacher that Johni threw the�rst punch.

b. Mid-distance Repeated Name: *?Johni told the teacher that Bill saidthat Johni threw the �rst punch.

c. Long-distance Repeated Name: ?Johni told the teacher that Bill saidthat Sam thought that Johni threw the �rst punch.

1.2.2 Visibility and Activation in Syntactic Memory

�e empirical data thus far compels us to embellish our parsing model in two ways.

(29) a. One: a range of empirical data suggests that we require more than abinary distinction in parser memory. It appears that a graded notionof syntactic availability is called for.

b. Two: it appears that closure is not the only mechanism by which struc-ture is ‘forgotten.’ We have observed di�erences in the availability ofsyntactic structure amongst nodes that should, by hypothesis, havethe same closure status.

�is state of a�airs led to two related developments in parsing theory. �e �rstis the widespread adoption of a graded notion of availability in parser memory.�e formulation of the Visibility Hypothesis is one instantiation of this idea, butsimilar ideas are shared very broadly among parsing theorists.

(30) Visibility Hypothesis: In �rst analysis and reanalysis, attachment to a visiblenode is less costly in terms of processing / attentional resources than attach-ment to a less visible node. (i) Node X is more visible than node Y if X waspostulated later than Y. (ii) Nodes within a perceptually-given package (e.g.intermediate phonological phrase) are more visible than nodes outside thepackage. Frazier and Cli�on (1998)

�e Visibility Hypothesis is a principle that regulates the availability of syntacticinformation in parser memory. It makes speci�c claims about what controls ’visibil-ity’, and we won’t have anything to say about these claims just yet. For the currentpurposes, the crucial theoretical innovation is the adoption of a real-valued notionof visibility or availability of syntactic information in parser memory. With thefurther stipulation that visibility is controlled by recency, a number of interestingand (apparently) empirically correct generalizations about parsing preferences:we prefer recent attachments, and in certain cases, the preference to attach locallymay in fact over-ride a more distant, but structurally more complex parse.

6


�is theoretical shi� happened almost e�ortlessly, because the notion of agraded availability of syntactic information is (apparently!) very easy to under-stand as a re�ection of continuously varying ‘strength’ in memory.�is shi� in ourthinking about syntactic memory raises the possibility that closure processes donot in fact play any role in how syntactic structure is forgotten. My understandingis that this possibility has been implicitly endorsed by most parsing theorists. Nowlet’s build an explicit model that formalizes a related notion.

1.3 Syntactic comprehension as skilled working memory retrieval

�e Visibility Hypothesis, and other activation-based models of syntactic compre-hension, are motivated by working memory considerations. However, this raisesthe question of whether or not these principles are independent theoretical entitiesthat are functionally grounded in working memory principles, or if it is possibleto in fact view recency e�ects as a direct re�ection of independently-motivatedworking memory processes. To see how we might do this, let’s translate Visibilityinto the language of memory access that we’ve been using to now. Let’s assume thatless ‘visible’ nodes are those nodes that are less accessible in the parser’s syntacticmemory store. Let’s further assume that in order to attach terminal node X intoa previously posited node Y, we need to reaccess Y from parser memory. In thissense, attachment into a given node requires our parser to locate that node inparser memory. Let’s call this a ‘retrieval’ operation: we are querying a syntacticrepresentation and attempting to �nd a landing site. One that landing site is foundand retrieved, assume that we can attach X into Y with a processing cost that isinvariant across choice of X and Y.

On this view, the recency e�ects captured by visibility still re�ect some gradedavailability of syntactic information; we are just calling it activation now. Further,less active or less visible nodes do not impede the process of attachment; rather,they impede the process of accessing a node, which is required for attachment intothat node. We may then recast Visibility in terms of memory access in the parser:

(31) Visibility-as-activation-in-retrival Hypothesis: In �rst analysis and reanaly-sis, retrieval of a node from parser memory feeds attachment processes.Retrieval of an active node is less costly in terms of processing / attentionalresources than retrieval of a less active node. (i) Node X is more active thannode Y if X was postulated later than Y. (ii) Nodes within a perceptually-given package (e.g. intermediate phonological phrase) are more active thannodes outside the package. Frazier and Cli�on (1998)

�e visibility-as-activation-in-retrieval hypothesis motivates the model that Lewisand Vasishth (2005) develop.�eir model is a computationally explicit implemen-tation of an activation-based parser that stakes out a highly reductionist view ofrecency e�ects: they are a re�ection of retrieval processes at work in the parser. Fur-thermore, Lewis and Vasishth note that we have heavily researched, independentlymotivated theories of how memory access works in various cognitive domains. Intheir model, they seek to determine whether the recency e�ects we’ve observedmight in fact be reduced to independently-motivated working memory processesoperating over domain-speci�c parsing representations.

Put di�erently, on the ACT-R parsing model they develop, the reason recencye�ects look very similar to e�ects seen in domain-general working memory (i.e.continuous activation, recency e�ects, primary e�ects), because they are the e�ectsseen in domain-general working memory tasks.�eir core claim is that syntacticparsing relies on domain-general memory mechanisms over domain-speci�c rep-resentations, and that the recency e�ects we observe re�ect this. Today we’ll justwork through the guts of the model and try to understand how it works.

1.3.1 Types of memory in ACT-R

As we try to understand the Lewis and Vasishth model of parsing, we �rst wantto look at the types of memory and knowledge structures allowed in the ACT-Rcognitive architecture. ACT-R seeks to provide a uni�ed theory of cognition, byproviding a handful of theoretical primitives from which all cognitive operationsare implemented. Critically for our purposes here, ACT-R divides knowledge intotwo categories with di�erent properties.

DeclarativeMemory: Declarative memory consists of information that is ‘writ-ten down’ in an explicit representation in somememory store. O�en this is thoughtof as things we are consciously aware of. For instance, it is natural to think of thelexicon as residing in a sort of declarative memory (this is sometimes referredto by memory researchers as ‘semantic knowledge’). ACT-R makes the followingformal claim about all declarative memory systems across cognition: they con-sist of chunks, which are typed bundles of feature-values that are de�ned over adomain-speci�c representational vocabulary.

(32)⎡⎢⎢⎢⎢⎢⎣

concept ∶ robinis − a ∶ bird

has − color ∶ red-breasted

⎤⎥⎥⎥⎥⎥⎦

7


⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

cat ∶ dethead ∶ thosenum ∶ plgender ∶ −case ∶ −

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦�e feature types and licit values are assumed to be domain speci�c.�e representa-tion of those, for instance, gives an ACT-R representation of a terminal node thosein a vocabulary of linguistic features: category features and ϕ-features associatedwith D.

Procedural Memory: In addition to the declarative representation of chunksin memory, ACT-R models of cognition may represent knowledge procedurally.Informally, this is knowledge that is taken to not be consciously available; peopleo�en refer to procedural knowledge as their ‘muscle memory.’ In ACT-R, procedu-ral knowledge is formally represented in the form of a production rule. Productionrules are deductive statements about what operations to make based on the cur-rent state of the cognitive modules. A production statement consists of an if-thenconditional statement, that speci�es the antecedent conditions that must hold forsome operation, and the consequent speci�es the action that the cognitive systemmust perform.

(33) Production Rule format: if condition A holds then perform action B

Example: Le�-Corner Parsing.�ese two ways of representing knowledge givean ACT-R cognitive model a limited amount of �exibility in modeling cognition.Let’s take an example of a single parser action in our current model, the le�-cornerprojection of a node:

(34) Det

the

→ DP

Det

the

In ACT-R, one natural way of modeling this is by representing the knowledgenecessary to perform this action procedurally, in a production rule.

(35) a. Production 1: if cat = Det then create DP node AND attach Det ashead.

With this single production rule, we would change the state of our declarativememory store:

(36)⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣


⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

cat ∶ DPhead ∶ Detnum ∶ plgender ∶ −case ∶ −

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦�is is exactly how the Lewis and Vasishth model captures the skill and gram-matical sophistication of parsing.�eir major claim is that the parse structure isrepresented node by node in declarative memory, and that grammatical knowledgeis represented only procedurally in the form of production rules.

�us we get to the �rst two claims about the use of working memory in theparser (from Lewis and Vasishth (2005), p. 391):

(37) a. Declarative memory for long-term lexical and novel linguistic structure.Both the intermediate structures built during sentence processingand long-term lexical content are represented in declarative form bychunks, which are bundles of feature-value pairs.

b. E�cient parsing skill in a procedural memory of production rules. Alarge set of highly speci�c production rules constitute the skill ofparsing and a compiled form of grammatical knowledge.�e parsingalgorithm is best described as a LC algorithm, with a mix of bottom-up and top-down control. Sentence processing consists of a seriesof memory retrievals guided by the production rules realizing thisparsing algorithm.

Learning in ACT-R consists, in part, of a transfer of knowledge from declarativememory to procedural memory as a function of practice and correlation betweenproduction rules and the declarative representations they operate over. In morefamiliar terms, this is a process of extracting ‘rules’ from the input.

8


1.3.2 �e character of declarative memory in ACT-R

�e declarative memory component of ACT-R assumes a �at structure, unlikethe highly specialized parser memory system we assumed in the Kimball parser.Declarative memory in their system serves a number of distinct functions: it isboth the lexicon and the short-term working memory store. Although there is nostructural distinction between these twomemory stores, long-term lexical materialand short-term structure may be distinguished on the basis of their feature content.

�e content of declarative memory is an unordered list of representationalchunks:

(38)⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣


⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦Whatmay be a ‘chunk’? Lewis andVasishth admit this is potentially amajor theoret-ical weakness in the proposal, if the modeler is allowed to arbitrarily de�ne chunksto suit the task at hand. However, in the context of syntactic parsing, this notionis thankfully rather circumscribed. A chunk is de�ned as the minimal elementthat may enter into novel relations with other chunks. If sentence processing is’principally a task of composing novel combinatorial representations’, then we arele� with the assumption that chunks in the parser’s declarativememory correspondto nodes in the parse.

One feature of the ACT-R declarative memory is that chunks in declarativememory have a �uctuating activation level associated with them. A node is avail-able to the extent that it is active in the local processing context.�e activation ofchunk i is described by the following equation:

A i = B i +∑jWjS ji (1)

�e �rst thing to note about this equation is that it has two terms, B i andWjS ji .�e former refers to the ’base activation’ and the latter to the strength ofthe associations between chunk i and the contents of the retrieval cues (more onthis in a moment).

�ese two terms give us some clues to where this equation comes from.�efunctional form of this equation is derived from what is known as rational analysis.Rational analysis is a way of approaching a problem that asks ‘what is the optimal

way to solve my problem, freed of any resource limitations’? ACT-R’s ‘rational’nature stems in part from this equation that describes memory access. Andersonand Schooler (1991) supposed that the optimal behavior for some memory systemwould be to always return the representation that is the most likely in some context:

argmaxi

P(R i ∣Context) (2)

P(R i ∣Context)∝ P(Context∣R i)P(R i) (3)

Anderson hypothesized a particular functional form for both the prior andthe likelihood in this equation, and then further supposed that the activation levelof representations in declarative memory is monotonically related to the posteriorprobability that it will need to be retrieved in the local context. �e activationequation in ACT-R is essentially the logarithm of this little bit of Bayesian numbercrunching. Bayesian decision theory tells us that choosing the memory with thehighest posterior probability ensures optimal performance of the system.

�e Bayesian view of this activation also helps us break it down into twosubparts: the prior and the likelihood.

�e prior: resting activation. B i contributes the resting level of activation of agiven chunk / node.�is is analogous to the prior probability in Bayesian models,and represents how likely a node is to be used in the null context.

B i = ln⎛⎝

n

∑j=

t−dj⎞⎠

(4)

t j is the time (in milliseconds) from the jth retrieval of chunk i. −d is a freeACT-R parameter, though the �tted value of this parameter is remarkably stableacross almost all ACT-R models, as Lewis and Vasishth note. It is 0.5, and it yieldsthe following decay function for a given chunk’s activation.

9


0 200 400 600 800 1000

ACT-R decay function

time (ms)

activation

�e likelihood: association between local context and memory chunks.∑ j WjS ji represents an associative match of a set of retrieval cues to the featurevalues associated with chunk i. Retrieval cues are a list of feature values that areused to access the declarative memory in ACT-R. Declarative memory in ACT-R iscontent-addressable; memories are accessed in the system not according to wherethey are, but according to what they are. Memory access proceeds by de�ning aset of retrieval cues, and then matching these cues against the contents of memory.�e associations between the retrieval cues and the contents of chunks in memorygives an activation boost de�ned by this equation, and this activation boost leadsto a chunk being accessed. An example:

Contents of Declarative Memory:

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

cat ∶ DPhead ∶ Detnum ∶ plgender ∶ −case ∶ nom

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

cat ∶ DPhead ∶ Detnum ∶ singgender ∶ −case ∶ acc

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

Retrieval Cues:

[ num ∶ plcase ∶ nom]

Given these parameters, we evaluate the value of∑ j WjS ji .Wj is the weightgiven to the jth feature of retrieval cue set.�is is not a free parameter in the model;all features are taken to have equal weighting, and it set to 1 divided by the numberof features in the retrieval cue set. S ji is the associative strength a�orded to a cue.It measures how ‘distinctive’ a particular retrieval cue is:

S ji = S − ln( f an j) (5)

Where the ‘fan’ is the number of items associated with a cue j.�e higher thefan, the greater the retrieval interference, and the lower value S ji takes on. S isthe maximum associative strength, a free parameter in the model that is generally�xed across ACT-R models.

In our example above, we note that the feature values in the retrieval cues havea fan of 1 each; there is only one chunk in memory with the plural value for thenumber feature, and one with the nominative value for the case feature. Becausea fan of 1 translates to zero retrieval interference (under the natural logarithm),there is no retrieval interference.�us the target DP receives an activation boostof (0.5*1.5+0.5*1.5) = 1.5.

Now consider what happens if we change the contents of declarative memory,so that both chunks now share the same number feature:

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

cat ∶ DPhead ∶ Detnum ∶ plgender ∶ −case ∶ nom

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

10


⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

cat ∶ DPhead ∶ Detnum ∶ plgender ∶ −case ∶ acc

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

Now the fan of the number retrieval cue is 2.�is increased fan of the pluralcue leads to retrieval interference; its utility as a retrieval cue is decreased, andthis is re�ected in a drop in the associative activation given to this node in thiscon�guration: for case the associative boost remains interference-free at 0.5*1.5,but for number it engenders retrieval interference, 0.5*0.8, leading to an activationboost of 1.15.

In this way, the existence of multiple chunks with similar feature values leadsto similarity-based retrieval interference.

A �nal step in specifying the declarative memory structure of ACT-R is tomap chunk activations onto memory retrieval processes. First, a winner-take-allprocess is assumed: the chunk with the highest activation is retrieved.�e timecourse of the retrieval of a chunk is given by the following equation:

Ti = Fe−Ai (6)

Phew! �is speci�es an associative memory structure for the parser. Lewisand Vasishth’s model has satis�ed our desire to have an activation-based syntacticmemory in the parser. It has done so by adopting the extreme hypothesis thatthere is only a single declarative memory store used by long-term representationsand short-term parsing representations alike, removing all temporary memorystructures such as a stack or a chart. We now have seen four of the �ve criticaltheoretical claims advanced by Lewis and Vasishth:


b. E�cient parsing skill in a procedural memory of production rules. Alarge set of highly speci�c production rules constitute the skill ofparsing and a compiled form of grammatical knowledge.�e parsingalgorithm is best described as a LC algorithm, with a mix of bottom-up and top-down control. Sentence processing consists of a series

of memory retrievals guided by the production rules realizing thisparsing algorithm.

c. Activation �uctuation as a function of usage and decay. Chunks havenumeric activation values that �utuate over time; activation re�ectsusage history and time based decay.�e activation a�ects their proba-bility and latency of retrieval.

d. Associative retrieval subject to interference. Chunks are retrieved bya content-addressed, associative retrieval process. Similarity-basedretrieval interference arises as a function of retrieval cue overlap:�ee�ective of a cue is reduced as the number of items associated withthe cue increases.

1.4 Tiny bu�ers

�e last core theoretical assumption made by the Lewis and Vasishth parser is thatthere are a limited number of memory bu�ers whose contents are limited to (atmost) a single chunk of information:

�e three bu�ers we need to focus on are i) the lexical access bu�er, whichstores the chunk retrieved from the lexicon for a particular phonological form, ii)the control or ‘goal’ bu�er where an expectation for an upcoming constituent ismaintained in the form of retrieval cues and iii) the retrieval bu�er, which stores

11


the contents of any working memory retrievals that are engaged. Production rulesmay reference the contents of two bu�ers simultaneously, without retrieval.

�e limited capacity of the bu�ers means that the parser may only activelymaintain two syntactic nodes at any given time for purposes of composition. Be-cause of this limited capacity, the role for retrieval processes now becomes clear:retrieval from declarative memory is how a syntactic node is moved from passivememory to an active bu�er for processing. Any time we wish to restore a nodeto active memory for further processing, we must engage a working memoryretrieval.


b. E�cient parsing skill in a procedural memory of production rules. Alarge set of highly speci�c production rules constitute the skill ofparsing and a compiled form of grammatical knowledge.�e parsingalgorithm is best described as a LC algorithm, with a mix of bottom-up and top-down control. Sentence processing consists of a seriesof memory retrievals guided by the production rules realizing thisparsing algorithm.

c. Activation �uctuation as a function of usage and decay. Chunks havenumeric activation values that �utuate over time; activation re�ectsusage history and time based decay.�e activation a�ects their proba-bility and latency of retrieval.

d. Associative retrieval subject to interference. Chunks are retrieved bya content-addressed, associative retrieval process. Similarity-basedretrieval interference arises as a function of retrieval cue overlap:�ee�ective of a cue is reduced as the number of items associated withthe cue increases.

e. Focused bu�ers holding single chunks.�ere are an architecturally �xedset of bu�ers, each of which holds a single chunk in a distinguishedstate that makes it available for processing. Items outside of the bu�ersmust be retrieved to be processed.

1.5 Putting it together: LC parsing with an associative memory store

Let’s chug through part of a parse of ‘�e man walks’. Before reading any words,we apply a production rule to create a TP constituent. Completion of the TP is

the goal of parsing. It is underspeci�ed with respect to most of its features; we justknow we want a TP at the end of the day.�is is the start symbol, and we let ourparser start here.

Declarative Memory:

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

cat ∶ TPspec ∶ −comp ∶ −head ∶ −num ∶ −tense ∶ −

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

Goal Bu�er:-Lexical Access Bu�er:-Retrieval Bu�er:-Next we hear ‘the’, perform lexical access, and have a lexical entry chunk resid-

ing in the lexical access bu�er:Declarative Memory:

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

cat ∶ TPspec ∶ −comp ∶ −head ∶ Tnum ∶ −tense ∶ −

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

Goal Bu�er:-Lexical Access Bu�er:

⎡⎢⎢⎢⎢⎢⎢⎢⎣

cat ∶ Dethead ∶ thenum ∶ −

gender ∶ −

⎤⎥⎥⎥⎥⎥⎥⎥⎦

Retrieval Bu�er:-

12


Recall that our production rules are deductive statements that specify parseractions to take when certain conditions hold. Currently we hold a Det in the lexicalaccess bu�er, and the control state of our parser is expecting a TP constituent.�isset of conditions on a le� corner parsing scheme allows us to infer that the parsermust construct this structure:

(41) TP

DP

Det

the

In order to do this, the �rst step is to retrieve the TP node, and restore it to theretrieval bu�er.�is retrieval is associative and cue-based.�e �rst productionrule sets the retrieval cues for a TP in the goal bu�er:

Declarative Memory:

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣


⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

Goal Bu�er:

[cat ∶ TP]

Lexical Access Bu�er:

⎡⎢⎢⎢⎢⎢⎢⎢⎣


gender ∶ −

⎤⎥⎥⎥⎥⎥⎥⎥⎦

Retrieval Bu�er:-

With the single retrieval cue for a TP lined up in the goal bu�er, an associativememory retrieval is executed.�e time and success of the retrieval is described bythe equations above.�e resulting con�guration is:

Declarative Memory:-Goal Bu�er:

[cat ∶ TP]

Lexical Access Bu�er:

⎡⎢⎢⎢⎢⎢⎢⎢⎣


gender ∶ −

⎤⎥⎥⎥⎥⎥⎥⎥⎦

Retrieval Bu�er:

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣


⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

Now we have a TP node in our retrieval bu�er, available for processing, and adeterminer in our lexical bu�er. Again, the parser’s actions in this context are clearon a le� corner algorithm: we must project a DP node, and attach that DP node inthe spec of TP.�is is accomplished in a single production rule, leaving us withthis parser state:

Declarative Memory:

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

cat ∶ TPspec ∶ DPcomp ∶ −head ∶ Tnum ∶ −tense ∶ −

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

13


⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

cat ∶ DPspec ∶ −comp ∶ −head ∶ thenum ∶ −tense ∶ −

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

Goal Bu�er:-Lexical Access Bu�er:-Retrieval Bu�er:-At this point we have moved information about the various nodes through our

specialized bu�ers, and constructed the le� corner projection from the Det node.We may proceed through the parse in a similar fashion, constructing nodes accord-ing to a le� corner strategy, retrieving nodes when attachment is called for, andmoving through the parse. At the end, we are le� with a declarative representationof the parse that is something like the following:

1.6 Major claims on recency

Let’s zoom out for a bit.�e ACT-R parser relies on memory retrieval to mediateattachment decisions. Memory retrieval operates using the same sorts of declar-

ative memory argued for in other domains: it is subject to recency e�ects andsimilarity-based interference.

As in Visibility, Recency in this model is entirely controlled by temporal decayof representations in a memory store. It assumes a particular quantitative formof the decay function, and we may compare this with experimental data and seehow well the model captures the empirical facts. Consider the following paradigmfromMcElree et al. (2003):

(42) a. �e book rippedb. �e book that the editor admired rippedc. �e book from the prestigious press that the editor admired ripped.d. �e book that the editor who quit the journal admired ripped.

McElree and colleagues used SAT to estimate the time necessary to retrieve thesubject DP the book upon reaching the verb ripped.�ey observed two regimes ofspeed, near and far:

To a very close approximation, the ACT-R decay function predicts the patternof data seen in McElree’s experiment. Note that these predictions do not just re-

14


�ect pure recency e�ects.�ey also re�ect similarity biased interference from thepresence of multiple subjects in these examples.

To see how similarity-based interference plays a role in the retrieval processesassumed by the model, consider the widely-studied contrast between subject andobject relatives in English (King and Just, 1991; Gibson (1998)):

(43) a. �e reporter who sent the photographer to the editor hoped for astory.

b. �e reporter who the photographer sent to the editor hoped for astory.

�e model predicts di�culty associated with retrieving an argument (subject orobject) upon reaching either the embedded or the main verb.

�is pattern arises as an interaction of similarity-based interference and re-cency e�ects in syntactic memory. In subject relative clauses, the relative pronounis more recent, and more distinct as a subject, than in the object relative clausecase.�is causes more di�cult retrieval of the relative pronoun at the verb sent inthe OR case than in the SR case, which predicts the pattern of di�culty observedin e.g. the King and Just experiments.

1.7 Zooming out

Let’s zoom out again. �e ACT-R model provides us with one concrete imple-mentation of a model of activation-based, associative syntactic memory. For ourpurposes, this models highlights several features of syntactic memory:

(44) a. Nodes in a parse tree appear to be associated with a continuouslyvarying activation or visibility that determines how di�cult it is toretrieve that node.

b. Activation decays as an exponential function of time.c. Activation of a node may be boosted by reactivation that node.d. Activation may be reduced due to similarity-based interference.

�is is not intended to be an exhaustive list of what modulates activation in syntac-tic memory, but it gives us a framework for thinking about what factors might beimportant in gating the availability of nodes in a syntactic parse tree. When nextwe meet, we’ll look at some further �ts of this model to experimental data.

�e last thing for us to note is that we have now staked out an extremely dif-ferent position than the one we started with. We have no more specialized parsermemory structures, such as a stack of constituents. All we have is a �at declara-tive memory, and �uctuating activation within that memory store. We have alsocompletely eliminated closure principles in favor of a �at, time-based activationfunction.

(45)

References

Caplan, D. (1972). Clause boundaries and recognition latencies for words in sen-tences. Attention, Perception & Psychophysics, 12:73–76.

Church, K. W. (1980). On memory limitations in natural language processing.Master’s thesis, Massachusetts Institute of Technology.

Frazier, L. (1978). On comprehending sentences: Syntactic parsing strategies. PhDthesis, University of Connecticut.

Frazier, L. and Cli�on, C. (1998). Reanalysis in sentence processing, chapter SentenceReanalysis, and Visibility. Kluwer Academic Publishers.

Gernsbacher, M. A., Hargreaves, D. J., and Beeman, M. (1989). Building andaccessing clausal representations:�e advantage of �rst mention versus theadvantage of clause recency. Journal of Memory and Language, 28:735–755.

Gibson, E. (1998). Linguistic complexity: Locality of syntactic dependencies. Cog-nition, 68:1–76.

Gibson, E., Pearlmutter, N., Canseco-Gonzalez, E., and Hickok, G. (1996). Recenypreference in the human sentence processing mechanism. Cognition, 59:23–59.

15


Kimball, J. (1963). Seven principles of surface structure parsing in natural language.Cognition, 2(1):15–47.

Lewis, R. and Vasishth, S. (2005). An activation-based model of sentence process-ing as skilled memory retrieval. Cognitive Science, 29:375–419.

McElree, B., Foraker, S., and Dyer, L. (2003). Memory structures that subservesentence comprehension. Journal of Memory and Language, 4:530–547.

Nivre, J. and Scholz, M. (2004). Deterministic dependency parsing of english text.In COLING ’04 Proceedings of the 20th international conerence on Computa-tional Linguistics.

Resnik, P. (1992). Le�-corner parsing and psychological plausibility. Proceedings ofthe Fourteenth International Conference on Computational Linguistics, Nantes,France.

Yamada, H. andMatsumoto, Y. (2003). Statistical dependency analysis with supportvector machines. In Proceedings of IWPT.

16

Date post:	29-Jul-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Closure, recency, and activation-based syntactic...

Documents