Download - The Role of Location Indexes in Spatial Perception: A ...of a feature one must focus attention on that feature. Visual attention, according to the widely held view, is unitary: it

A Model of Spatial Indexing

The Role of Location Indexes in Spatial Perception:A Sketch of the FINST Spatial-index Model

Zenon Pylyshyn

Center for Cognitive ScienceUniversity of Western Ontario

Introduction

Marr (1982) may have been one of the first vision researchers to insist that in modeling

vision it is important to separate the location of visual features from their type. He argued that in

early stages of visual processing there must be “place tokens” that enable subsequent stages of

the visual system to treat locations independent of what specific feature type was at that location.

Thus, in certain respects a collinear array of diverse features could still be perceived as a line,

and under certain conditions could function as such in perceptual phenomena like the Poggendorf

illusion.

The idea that locations and feature-types are encoded independently is not a new one. A

closely related distinction was widely acknowledged in the literature on list-learning and letter-

recognition, where it has long been known that item information could be encoded or retained

independent of order information (e.g., Estes, Allmeyer & Reder, 1976; Coles, Gratton, Bashore,

Eriksen & Donchin, 1985). The view also has considerable support from neurophysiology,

where evidence has been accumulating for two visual pathways, one specialized for location and

the other for identification (Mishkin, Ungerleider & Macko, 1983). Recently, this idea has

receved additional support from the finding that “conjunction illusions” occur under certain

conditions (Treisman & Gelade, 1980). Conjunction illusions are visual illusions in which one


property of a feature (e.g., colour) is mistakenly conjoined with another property (e.g., shape)

which is also present in the stimulus, though at another location. For example, a display

consisting of a green X and a red O may be reported as consisting of a red X and a green O. If

the locations and types of such features were inextricably linked in the internal perceptual

encoding (as they are in any pictorial representation), such illusions would not occur.

The sort of exchange of conjuncts has led to the postulation of independent “feature maps”

with only weak cross-bindings. Another, closely related way to view this phenomenon is in

terms of the dissociation of feature-type and feature-location. Then the cross-talk results from

the failure to bind two properties of the same feature. Feldman and Ballard (1982) view focused

attention as the mechanism typically use to fix the location of a visual object and thereby allow

independent feature-properties associated with that object to be bound together. The importance

of focused attention for feature conjoining is also stressed by Treisman & Gelade (1980). Two

empirical observations implicate the importance of directed, focused attention in the encoding of

feature location: (1) conjunction illusions only occur when attention is shared with another task,

and (2) time to search for conjunctions of features is linear in the number of objects in the search

set, whereas search for a single feature is independent of the number of objects in the display (i.e.

visually primitive features exhibit “pop-out”).

There is, however, an unsatisfactory aspect of the view that in order to encode the location

of a feature one must focus attention on that feature. Visual attention, according to the widely1held view, is unitary: it can only be directed to one place, or at least to only one local region at a

time and must be scanned from place to place in order to examine several places. Yet we clearly

can analyze patterns distributed over many places. In fact, in Marr’s example, the detection of a

pattern like collinearity of features requires that in some sense the location of all the features in

the set be available at one time, so that the collinearity of those features, rather than some others,

could be ascertained. In other words in evaluating the predicate COLLINEAR(x , x , …, x ) the1 2 n

arguments x , x , …, x must in some way be bound to specific tokens of the relevant visual1 2 n

features so that the evaluation takes place with respect to those very features in the scene.

Of course, there are many ways in which the arguments of such a visual predicate might

refer to (or be bound to) the locations of features. The locations might, for example, be explicitly

e n c o d e d a s C a r t e s i a n c o o r d i n a t e s a n d t h e c o d e s f o r t h e s e c o o r d i n a t e s a s s o c i a t e d w i t h t h e

arguments of the predicate, in which case the evaluation might proceed by checking that this

———————

1. There has been some discussion in the literature concerning how broad a region can be covered within focal attention and whether the scope ofthis focal region can be varied (e.g., Eriksen & St. James, 1986). Nonetheless, there is general agreement that there is only one region of focalattention, as opposed to several independent and noncontiguous regions (see, also, the discussions in Hoffman & Nelson, 1981; Laberge,1983).

– 2 –


array of coordinates forms a linear sequence. But this proposal seems implausible for a number

of reasons: (a) encoding of feature locations in terms of their (x,y) retinal coordinates would

presumably require that the relevant objects be first scanned sequentially and stored (on the

assumption that encoding such properties as coordinates, like the encoding of conjunctions of

features, requires focused attention), (b) the (x,y) coordinate seems too precise an encoding,

especially where larger features are involved, and (c) if the retinal coordinates were the basis for

location encoding it would be difficult to use them to detect patterns among moving elements. In

what follows, an alternative mechanism will be proposed for giving the cognitive system access

to places in the visual field at which some visual features are located, without assuming an

explicit encoding of the locations within some coordinate system, nor an encoding of the feature

type.

One of the assumptions of the present approach is that there is a pre-attentive or pre-

cognitive mechanism in the visual system for individuating features (or making particular feature

tokens conceptually distinct from the others), and for indexing their locations within the visual

field. The terms pre-attentive or pre-cognitive are used here in order to emphasize that the

hypothesized indexing process is an extremely primitive one that precedes such operations as the

recognition of patterns or the encoding of the relative locations of visual features. The basic idea

is that there is something the visual system must do before it can even begin to discern a spatial

pattern or spatial relations among component features in a display: it must “pick out” or as we

prefer to put it, “individuate” the features among which it will recognize some spatial relations,

such as the relations “above”, “part of”, “inside” and so on. Before you can determine that “this”

is inside “that” you must have a way to, in effect, “point to” the two features to which the

“inside” relation will apply.

In order to accomplish this “pointing” there is no need to have first recognized what feature-

types are being pointed to. All one needs is a way to pick out or index the locations of the

feature-tokens in question. The simple idea that individuation precedes explicit encoding leads

to one of the basic postulates of the present work (the notion of FINSTs), which in turn provides

some the tools needed to illuminate a number of other puzzles, including some phenomena

involving visual imagery. It also provides the basis for some very preliminary steps towards a

computational theory of perceptual-motor coordination.

– 3 –


Indexing the location of visual features using FINSTs

To help characterize the idea behind the FINST mechanism in a concrete way, imagine the

following. Suppose you place each of your fingers on a different object (or feature-token) in a

scene. Now imagine that the objects are moving about or that you are changing your position

while your fingers keep in contact with the objects. Even if you do not know anything at all

about what is located at the places that your fingers are touching, you are still in a position to

determine such things as whether the object that finger number 1 is touching is to the left of or

above the object that finger number 2 is touching, or whether the object that finger number 3 is

touching is larger than the object that finger number 4 is touching. Of course, you may not be

able to determine this directly without further analysis (e.g. by haptic exploration), but your

finger-contact gives you a way to, in effect, refer to the objects so that some further processing of

them can be undertaken. You do not first have to search for an object that meets some particular

description, because you have a direct way to locate or index relevant scene-tokens for further

processing. Moreover, the access that the finger-contact gives makes it inherently possible to

track a particular token –- i.e., to keep referring to what is, in virtue of its historical trace, the--

same object, independent of its location in space. In this way you can individuate an object, keep

it conceptually distinct from other objects, and continue to do so as it moves about.

Such a direct mechanical indexing makes it possible to do something that cannot be done

directly in vision. Touch provides a way of indicating to oneself, and therefore of thinking,

“this” object or “that” object, for any object being touched, independent of its location in space.

The parallel case in vision appears to be different, since it would seem that the equivalent of

having direct causal contact with a feature in a 3-D scene is not possible. That’s because the

only visual sensors we have are ones that respond to the 2-D retinal projection of the scene (cf

Pylyshyn, 1984; Fodor & Pylyshyn, 1981), and the mapping from the 3-D object-features to the

2-D retinal features is not in general reversible. Nonetheless –- and this is a central assumption--

of the FINST model –- we can do something very analogous to pointing. The present approach--

p o s i t s a m e c h a n i s m c a l l e d a F I N S T , w h i c h a l l o w s o n e t o a c c o m p l i s h , i n a l i m i t e d w a y ,

something that is functionally similar to indexing a feature in a 3-D scene, much as a finger

allowed us to index such a feature in the tactile example discussed above (which is why this type

of index was originally called an “INSTantiation FINger”, abbreviated as “FINST”).

A FINST is, in fact, a reference (or index) to a particular feature or feature-cluster on the

retina. However, a FINST has the following additional important property: Because of the way

clusters are primitively computed, a FINST keeps pointing to the “same” feature cluster as the

– 4 –


2cluster moves across the retina. If the retinal feature cluster identified in this way maintains a

reliable correlation (over time) with some particular feature of the distal scene, then the FINST

will succeed in pointing to that distal feature, independent of its location on the retina. Thus

distal features which are currently projected onto the retina can be indexed thorugh the FINST

mechanism in a way that is transparent to their retinal location. In addition, they allow the

system to relate particular retinal features to parts of the symbolic representation of the scene

being constructed. It does this by associating the FINST that is bound to a particular feature-

token, with the corresponding part of the internal representation of the scene.

Notice that a FINST is very different from an encoding of the position of a feature. The

FINST itself does not encode any properties of the feature in question, it merely makes it

possible to locate the feature in order to examine it further if needed. Like the “chunks” in short-

term memory (see Johnson, 1972; Newell, 1973) FINSTs are opaque to the properties of the

objects to which they refer. For example, although one FINST may have been activated by a

colour and another by a shape appearing in the visual field, both could be properties of the same

object. Yet one could not tell by examining the two FINSTs that they referred to two properties

located at the very same place (i.e., that they were distinct properties of the same visual element).

Conversely, one could not tell by examining two FINSTs that they referred to two different

places that had the some property. Unlike coordinates or other explicit codes, the only way to

tell whether two FINSTs refer to the same location or property is by accessing the features that

are indexed by the FINSTs in question and determining whether the locations or properties (and

not the FINSTs themselves) were identical. Thus there is a fundamental difference between

F I N S T s a n d v a r i o u s p o s s i b l e e n c o d i n g s a s s o c i a t e d w i t h f e a t u r e s ( e . g . e n c o d i n g s o f t h e i r

properties or their locations in some coordinate system).

One of the main purposes of FINSTs is to allow higher cognitive processes to refer to

s p e c i fi c v i s u a l f e a t u r e s i n e v a l u a t i n g c e r t a i n s p a t i a l - r e l a t i o n p r e d i c a t e s t h a t a p p l y t o t h e s e

features (some additional uses for FINSTs, which will be discussed later, include allowing a link3to be established between high-level descriptions of a scene and particular places in that scene,

———————

2. One might imagine, for example, a process operating in parallel over the retinal array and aggregating points that correspond to a putative edgeor other scene feature. There are a number of simple possible mechanisms that can be used to make a FINST a “sticky” reference – i.e., to---ensure that it keeps being attached to some particular feature-cluster independent of the retinal location of that feature. For example, thetraditional way of representing an aggregated set of points in a computer vision system is by maintaining a “list of contributing points” foreach cluster or aggregate. Whatever the method of aggregation, the “sticky reference” property of FINSTs could be accomplished by simplyfollowing the policy of assigning the same symbolic name to lists of contributing points from successive frames of view, if a significant subsetof the coordinates on the list remains the same. We are currently also experimenting with several different, and perhaps more psychologicallyplausible implementations of this identity-maintenance property of FINSTs. One is a network implementation, similar to the one that Koch &Ullman (1984) developed for modelling selective attention. The other is in the spirit of “token matching” approaches, using a predictive filtertechnique similar to that developed by Wu and Caelli (in press) for object tracking. These are the only portions of the FINST model that wehave attempted to implement so far, which deals with real digitized images (although see footnote 9).

3. We sometimes speak of FINSTs as indexing places in a scene, in order to emphasize that it is feature-location rather than feature-type that isbeing indexed. However, it should be kept in mind that the theory only provides for filled places to be indexed in this way, not places in atotally empty region of the visual field.

– 5 –


and allowing motor commands to refer to the locations of these features in order to direct limbs

or eye movements to them). Being able to index particular features is particularly important

when encoding relational properties involving several places. For example, the assumption is

made that in order for the cognitive system to encode a relational property holding among several

places –- such as COLLINEAR(x,y,z) or INSIDE(u,v) or PARALLEL(m:line, n:line) –- the-- --

arguments to these predicates must first be bound to features or places in the scene, i.e. FINSTs

must be assigned to the locations of the relevant features. Once assigned, groups of features or

“chunks” may also be formed, and under certain conditions a FINST may be assigned to the

entire chunk. A chunk which has a FINST bound to it may or may not also have FINSTs bound

to its component parts. However, in order to evaluate an n-place predicate, such as PART-

OF(x:element, c:chunk), all its arguments have to be bound to FINSTs.

The question of how FINSTs are assigned in the first instance remains open, although it

seems reasonable that they are assigned primarily in a stimulus-driven manner, perhaps by the

activation of locally-distinct properties of the stimulus –- particularly by new features entering--

the visual field. Indeed, there is evidence (Burkell & Pylyshyn, 1988) that some transients, such

as luminance changes, and not others, such as isoluminant colour changes, do attract FINSTs. In

addition, under certain conditions top-down processes may also play a role in specifying which

of the potential active features get assigned a FINST.

Because of their pivotal role in enabling relational encoding to take place, FINSTs occupy a

c r i t i c a l p l a c e i n v i s u a l p r o c e s s i n g . T h i s m a k e s i t p a r t i c u l a r l y t e m p t i n g t o v i e w t h e m a s

representing a resource-constraint bottleneck, similar in spirit to the hypothesized limit on the

number of chunks that may be held in short-term memory, or even closer in spirit to Newell’s

(1980) assumption of a cost associated with each variable that gets bound in the matching of

conditions in a production system. This assumption is indeed part of the provisional picture of

the FINST mechanism. Notice that with this assumption, if both a chunk and its parts are

indexed (as would be required in order to determine whether a certain feature is PART-OF

another) this requires more resources than if the chunk alone (or the parts alone) are indexed, as

seems reasonable on intuitive grounds.

As will be apparent, many of the assumptions surrounding the FINST idea are highly

provisional. Many questions concerning the properties of the FINST mechanism await further

empirical exploration. Nonetheless, there are already numerous consequences (to be discussed

below) of those of the present assumptions that seem most secure. The assumptions discussed

above are summarized in Table 1a for reference.

– 6 –


==============================

Insert Table 1 about here

==============================

Before discussing some empirical studies dealing with the role of FINSTs in visual attention,

another general argument for the need for such indexes will be sketched. This argument centres

around the problem of how we attain a stable representation of perceptual space. This, in turn,

leads to some additional assumptions concerning the role of FINSTs and other (non-visual)

indexes in motor control.

Constructing a spatially stable representation

Although the input to the visual system consists of continually moving images on the retina,

we nonetheless perceive a world that remains stable with respect to a global frame of reference.

This suggests that while there is an early stage at which the visual system operates upon a

retinotopic representation, there must be a later stage at which locations of perceived scene

features are encoded in relation to a frame of reference that is fixed in space (or at least a 2-D

projection of such a coordinate system).

It has been fairly traditional to assume that the geostability of visual perception means that

there is a representation which consists of a global image of the scene, fixed in distal (or world)

coordinates. The usual idea is that people construct and update such a geostable image of a

s c e n e b y “ p a i n t i n g ” t h e r e t i n o t o p i c r e p r e s e n t a t i o n o n t o a n e x t e n d e d 2 - D i m a g e o f t h e

environment, and that this image depicts a scene which is fixed within a geostable frame of4reference. In such models, the effect of eye movements is typically neutralized by locating the

point at which the retinal information is transferred to the extended image so that it is in exact

correspondence with the direction of gaze. One version of this view, called the “corollary

discharge” theory, claims that an “efference copy” of the signal going to the eye muscles is also

sent to the mechanism that superimposes the retinal image on the extended internal geostable

image. The fact that we can integrate information from different glances has typically been

taken as strong suppport for the view that there is a global stable image at some stage in visual———————

4. For example, Feldman’s (1985) model of spatial perception posits a stable “feature frame” representation, onto which the retinotopicrepresentation is mapped. Although Feldman’s feature frame is a global representation, it differs considerably from the simple “global-image”views: it is an active parameter-space representation, not a matrix corresponding to the 2D projection of the world into which the retinotopicinformation is deposited. Feldman uses “value units” to induce a mapping between the two frames (much as is done with the Hough transformmapping from images to parameter spaces; see Ballard, 1986). This differs from the present approach, which does not map the entireretinotopic representation onto some global space at all, but only provides indexes to selected FINSTed features, and cross-bindings to adescriptive symbolic representation.

– 7 –


processing. Such a view is widely accepted, even though the details of the registration process

are far from settled. (The facts concerning the relation between visual stability, gaze, and

various sources of information (such as motor efferents) appear to be open to question (see, for

example, Stevens, 1978; Steinbach, 1988; Miles & Kawano, 1987). Even the critical role of eye

movements is questionable since Hochberg (1968) has shown that under certain conditions

“glances” presented passively over the same retinal position can be perceptually integrated.)

One might ask whether the assumption of a global geostable image is necessary in order to

a c c o u n t f o r t h e p h e n o m e n a o f g e o s t a b i l i t y , o r w h e t h e r t h e r e l e v a n t p h e n o m e n a m i g h t b e

compatible with a simpler mechanism. To answer this question we must first be clear about the

empirical considerations that need to be addressed by such a mechanism. The basic one,

mentioned earlier, is that the world does not appear to move as our eyes move. Although

reliance on such phenomenology is generally considered problematic (for example, there is

evidence that people can respond to movements in the perceptual world of which they are not

consciously aware –- see, e.g., Goodale, Pelisson & Prablanc, 1986), there is no doubt that the--

p h e n o m e n a l e x p e r i e n c e o f s t a b i l i t y i s a n i m p o r t a n t r e a s o n f o r p o s t u l a t i n g a g e o s t a b l e

representation.

Another relevant consideration is the observation that certain perceptual phenomena appear

to depend on scene coordinates rather than retinal coordinates. For example, there is evidence

that apparent motion is sensitive to scene coordinates (Rock and Ebenholtz, 1962), and that the

“correspondence problem” (Ullman, 1979) may be solved in scene coordinates as well; although

in some of these cases it is an open question whether these processes operate over a 2-D or over

a 3-D representation (see Wright, Dawson & Pylyshyn, 1988). This too has suggested to people

that there is a stage in visual perception where the information is encoded as a global geostable

image.

Finally another straightforward, and from our perspective even more important aspect of

geostability, is the fact that perceived space connects with the motor system in a stable and

globally consistent manner. If we point to some object we perceive, the direction we point is

independent of where the projection of that object falls on our retina: it depends on where the

object is in scene coordinates.

What capacities must a system possess in order to be able to exhibit these phenomena, which

are characteristic of geostability? In order for a system to exhibit spatial stability, the following,

at least, should be true. First, under certain conditions of movement of features on the retina, the

system must be able to identify the sequence of features that correspond to the same place in the

– 8 –


scene. Second, the system must have some way to refer to the location of features that are not on

the retina (i.e. recalled features) in order to detect patterns that extend beyond the range of the

retina itself. Third, the system must have some way to coordinate movements (whether eye

movements or pointing) with the locations of both retinal and recalled (non-retinal) features. The

first two requirements recognize the need for some sort of coordination between retinal features

and off-retina features which allows one to identify sequences of proximal features as arising

from the same distal feature, even if the sequence is discontinuous and interrupted (i.e. as the

proximal feature moves off the retina and back again in the course of eye movements). The third

r e q u i r e m e n t r e c o g n i z e s t h a t p a r t o f g e o s t a b i l i t y c o n c e r n s t h e c r o s s - b i n d i n g o f l o c a t i o n s i n

perceptual and motor reference frames.

Although how these three requirements are met by the nervous system is far from clear,

there is at least reason to doubt that the task requires the piecewise “painting” of an extended

internal image. Indeed, the task does not even appear to require that locations of features be

explicitly encoded (say, in terms of their Cartesian coordinates), only that some means be

available for indexing the features so that they can be addressed by primitive perceptual and

motor operations. Consider how the FINST mechanism might provide a way to meet the first5requirement. To get an idea of how this might work, recall the “pointing fingers” analogy

discussed in the previous section. The use of mechanically-linked tactile sensors obviated the

need for an explicit encoding of global locations. One did not need such a global image in that

case because all the information needed for evaluating spatial-relation predicates remained in the

scene and could be accessed as required by using the tactile-links as indexes.

By hypothesis, FINSTs provide a precisely analogous way of indexing a number of feature-

places in a scene independent of their retinal locations. This, in turn, provides the basis for

achieving some of the effects that can be derived from an extended geostable image. For

example, if FINSTs provide the reference points for determining the relative perceived locations

of features, then the fact that FINSTs remain bound to distal features as the eyes move about6means that their relative perceived locations will remain invariant with eye movements. . This is

exactly what happened in the case of the tactile example discussed earlier, where fingers were

used as indexes to distal features. The approximate transparency of reference to scene features

———————

5. The following discussion should not be read as suggesting that the ability to index features using the FINST mechanism explains how humansachieve visual stability. Indeed, it seems quite likely that in the human visual system a variety of mechanisms take part in achieving this sortof stability – including monitoring both efferent and afferent signals from several sources, as well as monitoring a variety of dynamic visual---patterns, such as optic flow. The point of this discussion is simply to suggest that FINSTs may be sufficient for the task, and therefore to arguethat an extended internal image is not entailed by the facts of visual stability or the stability of visual-motor orientation.

6. Without the benefit of perceptually distinct features in a visual scene, to which perception can anchor stable referents, it is very difficult evento achieve visual stability. Thus vision in the Ganzfeld (or structureless visual field) is unstable and people lose sense of where their eyes arepointing or where they had previously been pointing. Indeed motion and form perception are both seriously affected after 90 seconds ofGanzfeld exposure (Avant, 1965)

– 9 –


that the FINST mechanism makes possible, means that as long as the relative locations of

indexed objects remain fixed in the scene, their perceived relative spatial locations will not

change even though their retinal locations are changing.

The second requirement on a system that can exhibit geostability properties, mentioned

above, was that it could integrate retinal information with information that is no longer on the

retina (but which might have been part of a previous glance). The “extended internal image” idea

is intended, in part, to allow the perceptual integration of these two types of information by

providing a representation that contained both types of features which then could be examined by

some subsequent mechanism (the “mind’s eye”). To see whether such a representation is need

for this purpose, we need to consider the nature of the stored off-retinal information and the type

of integration that is possible.

Only a little is known concerning this question. For example, it is known that off-retinal

information is encoded in a sufficiently partial or abstract form that it does not enter into

perceptual processes in precisely the same way as retinal information. Off-retinal information

does not combine with retinal information to produce certain perceptual phenomena that occur

when all the information is retinal. For example, impossible figures (such as the “devil’s

pitchfork”) are not easily detected if the distances between inconsistent portions is large, or if the

i n f o r m a t i o n i s p r e s e n t e d i n “ g l a n c e s ” i n s o m e a r b i t r a r y o r d e r . S i m i l a r l y , t h e a u t o m a t i c

interpretation of certain line drawings as depicting three dimensional objects does not occur as

readily if the parts of the figure are far apart or are not presented in an appropriate order or if the

segmentation of the scene into glances fails to present entire critical features in individual

glances (Hochberg, 1968). Such results suggest that the off-retinal information is not ‘visual’ in

the same way that retinal information is, but rather is abstract and conceptual –- much like the--

information in a mental image (Pylyshyn, 1981). In general it is easy to overestimate the amount

of visually reinterpretable information available at off-retinal locations. Indeed, if one could

index off-retinal locations in some way (as will be hypothesized below) then a simple label

attached to such an index (e.g. “concave edge,” “convex edge,” “outside boundary,” etc.) might

provide all the information needed to account for such things as the anorthoscope or eye-of-the-7needle effect (see Rock, 1981). This conclusion is suggested by Hochberg’s (1968) finding that

perception of forms presented in a sequence of preprogrammed “glances” only occurs if the

order of the glances is one that enables the identity of individual contours to be tracked (e.g. in

———————

7. In an unpublished study, Ian Howard showed that when an image is moved behind a slit at medium speeds in the anorthoscope (i.e. at speedsslow enough to avoid self-masking), the ability to recognize the pattern depends on the memory requirements of the task. If the image seenthrough the slit has many contours that must be followed, the task is more difficult than if there are few contours, even though the image maybe of exactly the same geometric form in the two cases, except for orientation (e.g. one might consist of a form such as “E” while the otherconsists of the same form rotated by 90 degrees: the former requires that three contours be tracked as the figure moves horizontally behind avertical slit, while the latter requires only one). At sufficiently slow speeds the advantage of the fewer-contour version disappears.

– 10 –


presenting a rectangle one would have to present the sides in an order which preserves their

connectivity –- in either a clockwise or counterclockwise cycle).--

T h u s i t a p p e a r s t h a t t h e f a c t s o f p e r c e p t u a l i n t e g r a t i o n m a y n o t r e q u i r e a n y t h i n g a s

extravagant as a global image. They do, however, require something more than has been

assumed in the FINST hypothesis so far; it requires a mechanism for evaluating relational

predicates involving both retinal and nonretinal places. In addition, we need a mechanism to

deal with the third requirement listed above; a way to relate the locations of perceived features to

motor commands. We shall return to both these issues in the next section.

Indexing for motor commands: Binding features to ANCHORs

In order to extend the usefulness of the FINST mechanism beyond the case where all

indexed information remains on the retina, it is neccessary to address such additional questions

as: (1) How does the system maintain the identity of a feature cluster when the cluster disappears

off the retina and later reappears? and (2) How, in general, does the system compute the spatial

relation among features that are not on the retina concurrently? These are difficult problems

because it is clear that their solution depends on proprioceptive as well as visual information, and

also because they involve memory. While it is not known how the human visual system

manages to achieve the skills referred to above, the present approach has been to ask first for

sufficient conditions for it to be possible.

Clearly we can represent proprioceptive information and we can issue motor commands that

result in our eyes or limbs moving to desired locations. Let us put aside, for the moment, the

question of how this is done. Let us assume that the ability to issue a certain limited set of motor

commands, which cause a limb or eye to move to selected sensed locations, is part of our

primitive perc e p t u a l - m o t o r c a p a c i t y . W h a t e v e r t h e m e c h a n i s m s b y w h ich these things are

accomplished, they are sure to be quite different from those with which we are currently familiar

–- such as those being used in the design of industrial robot arms.--

The strategy adopted in developing the present highly provisional and speculative ideas

c o n c e r n i n g s o m e p r o b l e m s o f p e r c e p t u a l - m o t o r c o o r d i n a t i o n m i g h t b e c a l l e d a m i n i m a l8mechanism strategy. In understanding how a cognitive or perceptual-motor function could be

———————

8. The term “minimal” is used here in an informal sense to suggest that the mechanisms appear to embody the smallest set of assumptionsnecessary for accomplishing the task – though there is no proof that no “simpler” mechanism is possible, and indeed the very notion of---

– 11 –


accomplished (how it is possible), one approach is to attempt first to discern the nature of what

has been called the ‘task demands’. There are a number of ways to approach this goal. One

way, championed by Marr (1982) is to attempt to develop a ‘theory of the computation’; an

abstract theory of the input-output function computed by the system which relates the function to

a goal (what, in the natural life of the organism, the function is for) and specifies some conditions

under which the goal can, in principle, be satisfied. This strategy has been extremely successful

in guiding research towards the discovery of a variety of ‘natural constraints’ among visual

properties.

T h e r e a r e , h o w e v e r , o t h e r h e u r i s t i c s t r a t e g i e s f o r a p p r o a c h i n g t h e d i f fi c u l t p r o b l e m o f

understanding the nature of ‘task demands’. An alternative strategy, which is the one adopted

here, is to take a small set of simple capacities that people appear to possess and see whether the

assumption that these are primitive operations in the human organism allows one to develop a

model that is sufficient for the task at hand, and which also accounts for certain otherwise

puzzling phenomena. This sort of minimalist top-down strategy has occasionally been used to

advantage in designing computational models. Good examples are Newell’s (1973) production

system architecture, and Marr and Nishihara’s (1976) SPASAR mechanism for rotating 3-D

models into a canonical orientation in the process of recognition. Inasmuch as it is also an

attempt to work out a set of basic operations which can be used to create a procedure for carrying

out the task, it is very similar in spirit to Ullman’s (1984) hypothesis of a set of basic operations

which form ‘visual routines’ for detecting spatial relationships in visual stimuli.

T h e s i m p l e o p erations that were initially assume to be primit i v e a r e t h o s e t h a t a s s i g n

FINSTs (which have already been discussed), together with a ‘MOVE’ operation which is

capable of causing certain objects to move to specified locations. As in the basic idea behind the

FINST hypothesis (wherein only FINSTed objects can serve as arguments to visual predicates),

it is assumed that only places that are indexed in the appropriate way can serve as arguments to

the MOVE command. The index that fills the role for the motor system, corresponding to the

FINST index in the visual system, is called an ANCHOR. Thus an ANCHOR is like a FINST,

except it indexes a place in motor-command space (and perhaps in proprioceptive space). One

might think of it as a reference to a place whose position can be accessed by the motor system in

just the way that FINSTed places (see footnote 3) can be accessed by the visual system.

In the simplest version of this speculative model, the only movable objects that have been

postulated are the centre of the visual field (think of this as a pair of cross-hairs) called the

———————simplicity used here is not made explicit. The mechanisms are minimal in the sense that a Turing Machine is a minimal mechanism forcomputing, namely it is very elementary, yet sufficient for the task.

– 12 –


‘FOVEA’, and another object (think of this as the end of a limb) called the ‘POINTER’. The

reason for beginning with such a simple and restrictive set of objects is that this provides a way

t o e x p l o r e t h e q u e s t i o n o f w h a t a d d i t i o n a l a s s u m p t i o n s a r e n e e d e d i n o r d e r t o b e a b l e t o

command these movable objects to move to places that are seen (i.e. are currently on the retina)

as well as to places that were previously seen but must now be recalled from memory. In other

words, one is asking what operations appear to be demanded by the nature of the task being9examined.

Since only ANCHORed objects can appear as arguments in the MOVE command, the two

movable objects are assumed to be automatically bound to an ANCHOR. In order to be able to

MOVE these objects to both seen and unseen places, a new operation is required that can cross-

b i n d F I N S T s a n d A N C H O R s . T h i s a d d i t i o n a l o p e r a t i o n , d e s i g n a t e d B I N D ( x : F I N S T ,

y:ANCHOR), is what makes it possible not only to coordinate between modalities, but also

allows features that were detected visually (by being FINSTed) to be later referred to by the

motor system –- even after they are no longer visible. This is done by first cross-binding the--

relevant FINST to an ANCHOR, and then issuing a command to move one of the movable

objects to the location of that (currently invisible) feature. Since one of the objects that can be

moved is the FOVEA, this allows the eye to be moved back to an object that has left the visual10field . Inasmuch as the number of both FINSTs and ANCHORs is limited, this process can only

be carried out in a restricted way.

The only other assumption that is needed to account for the limited ability being explored is

the assumption that certain relational visual predicates, like LEFT-OF or ABOVE, can apply to

sets of features not all of which need be visible, so long as each is bound to either a FINST or to

an ANCHOR. These additional assumptions of the model are summarized in Table 1b.

A summary of the way that FINSTs function in indexing visual features, and providing a

way to cross-reference them to both the evolving internal description and the motor system, is

shown diagrammatically in Figure 1.———————

9. The original problem that was investigated (described in Pylyshyn, Elcock, Marmor and Sander, 1978a) was to determine a minimal set ofassumptions that were necessary in order for a system to be able to draw simple diagrams from a description, and to reason about them in asimple way – e.g. to discover ‘new’ properties that emerged as the diagram was being drawn. Such a simple system was, in fact, implemented---in a Planner-like language called POPLER 1.5 and is described in Pylyshyn, Elcock, Marmor, and Sander (1978b). The model onlyimplemented a version of the mechanism which associates features of the diagram with an evolving symbolic description (the diagram was notphysically drawn, but merely simulated, so that the actual vision component was not implemented). The primary purpose of thisimplementation was to examine empirically whether the ideas sketched herein could in fact serve as the basis of a working system: inparticular, whether the minimal mechanism was sufficient for drawing and keeping track of figures that were larger than the retina. It would,of course, have been preferable to prove mathematically some results about the limits of a system based on these principles, but it was felt tobe premature to undertake such an analysis, given the extremely provisional nature of the assumptions under investigation.

10. The assumption that only ANCHORed/FINSTed objects can be the targets of movements receives some support from single cell recordingstudies. Goldberg and Wurtz (1972), and Wurtz and Mohler (1976) have shown that at the level of the superior colliculus, the firing rate ofcells whose receptive field coincides with the target of an eye movement increases, with the increase occurring well before the eye movementitself begins. This suggests that the activation of such cells may correspond to the binding of FINSTs to ANCHORs at these locations prior tothe issuance of a motor command.

– 13 –


==============================

Insert Figure 1 about here

==============================

To summarize: FINSTs allow internal representations to refer to places in a visual scene that

have not yet been assigned unique descriptions. In addition, they allow multiple references to be

made simultaneously, and also allow the motor system to, in effect, issue commands to move a

limb to certain visually perceived locations. The capacity to make such indexical references in

vision has far-reaching implications. A few of the consequences of this primitive mechanism for

explaining various empirical phenomena will be discussed below.

An empirical demonstration of the FINST mechanism:

Tracking multiple independent targets

Perhaps the easiest way to illustrate the FINST hypothesis in a concrete manner is to

describe an experiment intended to be a fairly direct test of several of the basic assumptions

behind this notion.

Consider the following experiment (for more details, see Pylyshyn & Storm, in press).

Suppose subjects are shown a field of identical randomly arranged points and are required to

keep track of some subset of them (called the “targets”) –- as they must if their task is to count--

the targets, or to indicate when one of them flickers or moves. In such a task, subjects might

proceed by encoding the location of each of the targets with respect to either a local or global

frame of reference, thus making it possible to distinguish and keep track of each target by its

coordinates. The encoding of relative positions might be facilitated by noticing a pattern formed

by the points, thereby “chunking” the set in a single mnemonic pattern. What clearly would not

work in this situation is to remember visual characteristics of the target subset, since the targets

and non-targets are visually identical.

Now suppose the points are set into random independent motion, and the subject is required

to indicate (by pressing a button) whenever one of the target objects briefly changes its shape, or

to indicate (by pressing another button) whenever a non-target briefly changes its shape. In this

– 14 –


case the distinctiveness of each point cannot be attributed to its location, since this is continually

changing. Hence storing a code for the location of each point would not help to solve the

problem, unless the location code is updated sufficiently frequently. The update frequency

would have to be such that during the time between updates the target remained within a small

region where it would not be confused with some nearby non-target. If location codes have to be

assigned in series by moving attention to each in turn (as most people believe), this would entail

sampling and encoding locations according to some sampling schedule in which points are

scanned in sequence. If one had some idea of the maximum rate at which points could be visited

and their locations encoded, it might be possible to design a display sequence that would cause

this strategy to fail –- say because the points would have moved far enough during the sample--

interval that there was a high probability that another point was now in the place occupied earlier

by the point whose location code one was attempting to update. Under such conditions, subjects

should no longer be able to do the multiple-tracking task described above.

Such an experiment was in fact carried out, and is summarized described below. The

following, however, was the conclusion: Using some widely accepted assumptions concerning

the location encoding process, it was found that subjects could do very much better at this task

than predicted by the sequential encoding procedure. What, then is a possible mechanism for

carrying out this task? If the assumptions and analysis of the experimental situation are correct,

it appears that subjects are able to simultaneously keep track of at least 4, or perhaps even 5 or

more distinct features in the visual field, without encoding their location relative to a global

frame of reference (e.g., without using some explicit symbolic location code). This is precisely

what the FINST hypothesis claims: it says that there is a primitive referencing mechanism for

pointing to certain kinds of features, thereby maintaining their distinctive identity without either

recognizing them (in the sense of categorizing them), or explicitly encoding their locations.

Now consider the details of the experiment. Based on some preliminary studies it was

determined that subjects could track at least 4 randomly-moving points (in the shape of “+”

signs) in a total field of 8 such randomly-moving points, and could detect whether a probe (a

square flashed for 83 msec) occurred on a target, a non-target, or at some other location. In order

to design the task in such a way as to preclude its solution by a sequential-sampling procedure,

appeal was made to the generally held view that in order to encode the location of a point, a

subject must attend to that point. As Anne Treisman and others have shown (e.g. Treisman &

Gelade, 1980), noticing that a stimulus contains a certain feature is not the same as noticing

where that feature is: the two can be functionally dissociated. In order for the information about

location to be available for such purposes as identifying where the point is in relation to some

frame of reference or some other fea t u r e , i t s e e m s t h a t t h e f e a t u r e h a s t o b e a t tended to.

– 15 –


Furthermore, it is widely believed (see the scanning velocity references listed below) that this

sort of attention is unitary –- i.e. there is only one attention locus which must be moved from--

place to place. Attending, according to this view, entails actually moving a locus of focused

attention (without necessarily moving the eye) to that location. A substantial number of studies

now exist which conclude that a single locus of processing must be moved about in the visual

field and that the movement is continuous (although there are some investigators who disagree

with one or another of the single-locus or the continuous-movement assumptions; see below for

references). Since the FINST hypothesis represents an alternative way in which “attention” may,

in effect, get from one location to another (viz, the system might access a feature through one

index and then access another feature through a second index, and thus not have to scan across

the intervening space), it must be shown that a procedure based on serial scanning could not

account for observed results.

The velocity with which attention appears to move within the visual field has been estimated

by various researchers, using quite different techniques, to range from 30 to 250 degrees per

second (i.e. from 33 to 4 msecs/degree). For example, Ericson & Schultz (1977) provide the

slowest measure of scanning velocity as 30.3 deg/sec; Joliceour, Ullman & Mackay (1985) found

contour-following to take from 38.5 to 41.7 deg/sec; Shulman, Remington & McLean (1979)

give 52.6 deg/sec for visual scanning; Tsal’s (1983) more direct measurement yields 117.6

deg/sec and Posner, Nissen & Ogden (1978) provide the fastest figure of 250 deg/sec. Many of

these estimates have been questioned; indeed, there has been some criticism of the general

methodology which led many people to conclude that attention must move continuously through

intermediate positions (Remington & Pierce, 1984; Eriksen & Murphy, 1987; Yantis, 1988).

However, if one accepts the widely-held view that attention is unitary and moves continuously,

then 250 degrees/sec (or 4 msec/degree) would certainly appear to be an upper bound on the

speed with which it can move.

Now if the minimum path length required to scan all 4 points being tracked is known, the

dispersion of the points and their velocity can be set so as to ensure that the scan-and-encode

method will frequently mistake a distractor for a target. This was done by a combination of

making the mean speed of movement of the points sufficiently high (8 degrees/second), the meanopath length sufficiently long (about 34 ), the predictability of the location of a point from its

current velocity and direction sufficiently low (by changing object velocity and direction often),

and the total tracking time sufficiently high (about 4 seconds), and by ensuring that a target is

never more than 1.5 degrees from a distractor.

– 16 –


The determination of the probability of erroneously switching to tracking a distractor prior

to the time of the probe was done by actually simulating a sequential scan of the very stimuli

used in the experiment, and having the simulated process pick the object nearest the encoded

location at each sampling cycle. Time and distance parameters used in the simulation were

based on measurements made on the actual displays used in the experiment. The sequence of

displays were examined and the shortest path covering all 4 targets was measured on each frame

(then averaged over the entire trial). This distance, together with different assumed values of

attention-scanning velocities, were used to obtain the appropriate intersample time. Several

d i f f e r e n t s c a n n i n g s t r a t e g i e s w e r e s i m u l a t e d , i n c l u d i n g a c o m p l e x s t r a t e g y b a s e d o n t h e

assumption that subjects detected the speed and direction of the sampled point and used this to

project, and store, the location at which the point was expected to be when next sampled. An

additional “sophisticated guessing strategy” was also simulated. This assumes that subjects can

reliably detect the occurence of a probe event even when the event does not occur on an object

being tracked, and also assumes (rather unrealistically, though conservatively) that subjects can

discern when they have lost track of targets. In this case, subjects could guess on whenever a

probe occurs on a “lost” trial by randomly selecting one of the three possible responses (i.e.

indicating whether the probe occurred on the target, nontarget, or neither).

The predicted performance derive d f r o m t h e s e s i m u l a t i o n s , t o g e t h e r w i t h t h e observed

performance, are shown in Figure 2, plotted as a function of the velocity of attention scanning.

S i n c e t h e u p p e r l i m i t f o r s c a n n i n g v e l o c i t y i s t a k e n ( a g a i n r a t h e r c o n s e r v a t i v e l y ) t o b e 4

msec/degree, the results show clearly that subjects are not sampling the points in a sequence of

move-encode operations. (The details of the experimental design and the serial scan model used

in the prediction are described in Pylyshyn & Storm, in press).

==============================


==============================

The conclusion, then, is that the 4 targets are being tracked in parallel, and that the tracking

is not based on encoding the locations of points with respect to some frame of reference, but

rather is based on a simple dynamically-maintained indexing scheme such as that proposed by11the FINST hypothesis.

———————

11. Since the above experiment was reported, a number of other studies have been carried out using different equipment (i.e., a CommodoreAmiga computer) made it possible to achieve smooth movement with faster speeds, to use up to 6 targets whose shapes were varieddynamically, and to construct trajectories that avoided collisions by simulating an inverse-square law repulsion about each object (instead ofby making discrete direction changes just prior to a potential collision, as in the present study). It was found that in this more complex setup,

– 17 –


Implications for Visual Routines

This section returns to a consideration of the relevance of FINSTs for the computation of

spatial relations. Ullman (1984) has examined a number of spatial properties that the human

visual system can compute with apparent ease, and has asked how this might be done. In the

case of many spatial relations (e.g. “inside”) it is difficult to see how the relation could be

computed by a purely parallel process, without any sequential scanning of the display, since it

requires checking on the relation between the location of a point and an arbitrary curve. All the

possible algorithms that Ullman considers involve some serial process, such as “painting” a

region beginning either at the point in question or at places along the curve, or extending radial

lines from the point in question and noticing the parity of their crossings with the curve. In all

these cases Ullman concludes that “The execution of visual routines requires a capacity to

control locations at which elemental operations are applied (Ullman, 1984, p135).” The same

also appears to be true for the detection of a number of other visual properties, such as whether

two points lie on the same contour (e.g., Jolicoeur, 1988).

Although it is clear that a capacity to control locations at which processing is carried out is

necessary, it does not follow that this must involve moving a unitary locus of attention to that

location, or encoding the location in some explicit way. Thus while there is a sense in which

Ullman may be correct in claiming that “The marking of a location for later reference requires a

coordinate system…with respect to which the location is defined”, there is no need to assume

that the location has to be defined by an explicit set of coordinate codes.

T h e F I N S T m e c h a n i s m s h o w s h o w o n e c a n m a r k a l o c a t i o n i n a m a n n e r t h a t w i l l

subsequently allow processing to be directed to it if, for example, some process needs to access

information indexed to that location, or even if a spatially local focus of attention has to be

directed to that location for some reason. FINSTs also provide a way for the location to be

referred to in certain primitive motor commands (since a FINST can be cross-bound to an

ANCHOR). Yet FINSTs do not themselves make an encoding of the location of features

available to high-level processes, such as ones that compute spatial relations between the place in

question and other places. They simply make it possible for appropriate processes to obtain

access to such information; they do this by indexing the feature in question in the actual display.

———————experienced subjects were able to perform even better than those in the original experiment, due primarily to the relative ease of trackingsmoothly accelerating motion.

– 18 –


E n c o d i n g s p a t i a l i n f o r m a t i o n a l w a y s r e q u i r e s a d d i t i o n a l p r o c e s s i n g m e c h a n i s m s . M o s t

predicates, and virtually all of Ullman’s proposed visual routines, require that more than one

place be indexed prior to the computation proceeding. Furthermore, many of the studies on

perceptual attention scanning use peripheral cues to induce the movement of attention (e.g.

Posner, Nissen & Ogden, 1978). In these cases the cue itself would have to be located in order to

serve as a directional indicator, and this would have to be done prior to attention being shifted to

it. Thus it is clear that indexing is quite different from “attending” in the usual sense, where this

term is understood to mean a single spatial focus of processing.

Consider the following examples of relations requiring visual routines. Figure 3 shows

some stimuli used to illustrate tasks requiring visual routines (several of these were reported by

Ullman, 1984). In panel (a) the task is to decide whether point x (or x’) is inside the contour. In

panel (b) the task is to say whether points x and y (or x and y’) are on the same contour (as in the

studies reported by Jolicoeur, 1988). In panel (c) the task is to say whether there is a path from

the centre of the circle to the circle itself. In panel (d) the task is to say how many points there

are. In panel (e) the task is to say whether the 3 objects are collinear. Notice that in each case

the task cannot be done without indexing several visual objects. In some cases all the objects in

question are points. In others, such as (a) and (c), they include contours.

==============================


==============================

It is not known whether entire contours can be FINSTed, though there are some reasons for

thinking that at least simple contours, or short smooth segments of larger contours can. For

example, Rock and Gutman (1981) have demonstrated that people can attend selectively to a

contour of one colour when it is intertwined with a similar contour of a different colour, as

shown by their inability to recognize the unattended contour as one they had seen before. In

other relevant experiments, Treisman & Kahneman (1983), and Kahneman, Treisman & Gibbs

(1983) showed that a letter presented briefly in a particular moving box primes recognition for

that letter with particular effectiveness when the letter recurs in the same box, even when the box

is in a new location. This suggests that subjects can track the movements of contours such as

boxes, and that they also use the identity of an object (such as a box in this case) to index other

associated properties. On the other hand, the fact that the difficulty in evaluation the “inside”

predicate depends to some extent on the size and shape of the bounding contour –- at least when--

the contour becomes sufficiently complex –- suggests that FINSTing entire contours may not be--

– 19 –


a s i m p l e p r i m i t i v e o p e r a t i o n . T h e p r e s e n t ( p r o v i s i o n a l ) a s s u m p t i o n i s t h a t s o m e l a r g e r

aggregates can indeed be FINSTed. However, it may be that an entire contour such as that in

fi g u r e 3 a r e q u i r e s s e v e r a l F I N S T s t o c o v e r d i s t i n c t s e g m e n t s o f t h e c u r v e , a n d t h a t h o w

accurately a FINST localizes features depends on such factors as their distinctiveness and on

how many features compete for the pool of available FINSTS (recall that the FINST allocation

process is resource-limited).

In any case it is clear that some pre-attentive indexing must be going on. The assumption

that there are limits on the number of such indexes that can be simultaneously maintained also

seems plausible. The tracking experiment described above suggests that at least 4 (and possibly

as many as 5 or 6) FINSTs are possible –- and this number is a lower bound estimate obtained in--

a task designed to be particularly difficult. Subitizing (which requires that objects be marked

rapidly as they are counted) suggests about 4 FINSTs in that case (e.g. Klahr, 1973). It seems

likely that the amount of information that can be indexed in this way might be increased by

“chunking” patterns and then FINSTing the entire chunk (see, for example, Mahoney & Ullman,

1988), much as the amount of information held in short-term memory can be increased by

chunking. Clearly there remain many unanswered empirical questions concerning exactly what

kinds and how many features can be FINSTed, though the principle that a number of different

features can be indexed in something like the way assumed by the FINST hypothesis seems well

supported.

Implications for studies of mental imagery

One of the phenomena that led to develop the FINST hypothesis in the first place was the

w i d e s p r e a d a s s u m p t i o n t h a t i n c e r t a i n k i n d s o f r e a s o n i n g p e o p l e c o n s t r u c t a n d e x a m i n e a

representation that has many of the properties of a picture (e.g., it has intrinsic metrical and

geometrical properties). This is typically what is meant in referring to a representation as an

“image”. The nature of this representation is assumed to be similar whether constructed from

retinotopic information in perception, or from long-term memory in the course of imagining.

The question of whether the facts of spatial stability of perception requires such a representation

has already been raised. In this section, certain evidence will be examined which is frequently

taken to show the existence of such a representation, constructed in a mental workspace in the

course of reasoning.

Among the phenomena that have led some people to assume the existence of a spatially

extended object referred to as an “image” are such findings as the increased time it takes subjects

– 20 –


to report properties of imagined objects when they are instructed to imagine the objects as

‘smaller’, the increased time it takes to mentally scan longer distances in an image, as well as

certain other parallels between imagery and perception (such as motor adaptation to imagined

errors in pointing, which parallel adaptation to observed errors in pointing induced by displacing

prisms; Finke, 1979). Pylyshyn (1981) argues that at least the scanning phenomena, and perhaps

other similar phenomena as well, are due to the demands of the task, and in particular to subjects’

tacit knowledge of what would happen in the real situation being imagined, rather than to any

intrinsic properties of how images are represented. There is one case, however, that does not

appear to be subject to the task-demand criticism; this is where images are “projected” onto some

visual scene (e.g. Finke & Pinker, 1982). In this case the phenomena do not disappear when

instructions are changed appropriately. However, it appears that the scanning results in such

cases can be accounted for by the FINST hypothesis without the need to posit a spatially-

extended internal representation.

Before discussing how the FINST hypothesis can deal with these scanning results, consider

how the FINST idea might be relevant to projected-image tasks in general. A particularly simple

illustration of how the FINST hypothesis can deal with such phenomenon is an experiment

described by Shepard and Podgorny (1978). In one version of this experiment (described in

Shepard, 1978), a subject inspects a grid on which a pattern, such as the capital letter “F”, is

outlined. A small spot appears briefly on the display, and the subject must press one of two

buttons; one if the spot occurs in a grid square within the letter, the other if it occurs in one of the

grid squares not inside the letter. Reaction time was found to vary systematically with the

location of the spot on the display: it is generally shorter when the spot is inside the letter and is

shortest when the square on which it occurs lies at the intersection of two or more letter-strokes

(i.e. at an “L” or a “T” vertex of the block letter). What was most interesting, however, is that

exactly the same pattern of results is found when the subject is asked to imagine the letter on the

grid, rather than being shown the actual letter on the screen.

The results of this and other similar studies (e.g. Hayes, 1973) have usually been taken as

evidence that there is a superposition of two stable extended “images”, of the sort discussed

earlier. However, it now appears that in these cases the results can be accounted for quite simply

be appealing to the same mechanism that was used earlier to explain certain phenomena of

geostability, which also do not appear to require an extended internal ‘image’. All one needs to

assume is that in both the perceptual and imaginal conditions, the subject prepares for the task by

placing FINSTs on selected features, or aggregates of perceptually-integral features, such as grid

squares, or even rows or columns of such squares that make up letter strokes. Since this

mechanism allows the subject to index actual places in the display (i.e. particular grid squares) –---

– 21 –


whether or not there is actually something graphically distinct about those grid squares –- the--

task of deciding whether the probe appears on one of these indexed places is carried out visually12in both cases.

Morever, the systematic pattern of reaction times in both visual and imaginal-visual cases

can be explained in exactly the same way. In neither case is an internal image required, only the

ability to index feature-clusters. For example, consider how the presence of FINSTs might be

used to explain why reaction time to a spot is shorter at a vertex than in the middle of a stroke.

One might develop a stochastic race model in which FINSTs to places that are indexed as figure-

strokes are followed in parallel, and a positive response made when the first such index is found

to lead to a region (assumed to be marked by texture elements or by a grid) which is also indexed

by a probe FINST. If strokes are independently indexed, then there are two paths to a vertex and

only one to the middle of a stroke; hence the time to verify the vertex location would be less than

the mid-stroke location. Whatever the merits of such a speculative model, notice that the same

explanation would hold for the visual case as for the “projected image” case.

The differences between this view and the conventional “image” position are substantial,

and have far-reaching implications for theories of cognitive processing. The standard imagery

view (e.g. as put forward by Kosslyn, Pinker, Smith & Shwartz, 1979, and others) hypothesizes a

r e p r e s e n t a t i o n t h a t i s p h y s i c a l l y r e a l i z e d i n s o m e a n a l o g u e m e d i u m w i t h c e r t a i n i n t r i n s i c

Euclidean or metrical properties. This approach assumes that it is these analogue properties of

the medium itself that explain such things as, for example, the increased reaction time that occurs

with increased image-distance scanned (since, according to this view, there is a real physical

analogue to “distance” in the representation of the image –- an analogue that obeys the physical--

law distance = speed x time).

But note that in the case where imagined places are “projected” onto a visual scene, we do

not need to appeal to particular properties of an internal analogue medium in order to explain

certain psychophysical phenomena, such as those involving “scanning”. We need only appeal to

the relevant properties of the real scene, together with some plausible assumptions concerning

t h e p e r c e p t u a l p r o c e s s ( f o r e x a m p l e , t h a t p e r c e p t i o n c a n v e r i d i c a l l y e n c o d e c e r t a i n s p a t i a l

relationships that hold among FINSTed elements in a scene). The FINST mechanism also makes

it possible to associate conceptual “labels” with such places, and thus could in principle enable

the perceptual-motor system to behave in certain (restricted) respects as though particular kinds

———————

12. The Shepard and Podgorny result can be obtained without using a grid (as in the Hayes study) although the version that uses grids isdescribed above for simplicity of exposition. For present purposes, all that is required is that there be some visual features that can act asreference points for locating places where filled squares and/or target points occur. Surface texture elements are sufficient for this purpose.As has already remarked, such texture elements are neccessary for even the most rudimentary visual stability to occur (footnote 6).

– 22 –


13of features actually were located at these places. . For example, it could allow attention to be

“scanned” to such indexed places, with or without actual eye movements. If that were the case,

then the increased time taken when the scanned distances are greater would simply be the

consequence of a physical law, since real physical distances, not representations of distances,

were being traversed.

Forming visual descriptions

FINSTs may also play an important role in the process of encoding and recognizing a scene.

There is considerable evidence now that the encoding of visual information in memory is a

p r o c e s s o f f o r m i n g i n t e r n a l d e s c r i p t i o n s ( s e e , f o r e x a m p l e , t h e d i s c u s s i o n o f t h i s p o i n t i n

P y l y s h y n , 1 9 7 3 ) . T h e r e h a s b e e n a g r e a t d e a l o f e x p l o r a t o r y r e s e a r c h i n b o t h A r t i fi c i a l

Intelligence and in Cognitive Science on the nature of the basic component shapes (i.e., on the

vocabulary of the descriptions). Such primitive shape elements play an important role in an

approach to object recognition called ‘recognition by parts’ (e.g. Pentland, 1987; Hoffman &

Richards, 1985; Beiderman, 1988). In this approach, an unknown object is described in terms of

t h e s h a p e - c l a s s o f i t s c o n s t i t u ent parts, the transformat i o n s o f t h e s e c a n o n i c a l s h a p e s ( e . g .

rotated, tapered, elongated, etc), and the relations among them. As the description is being built,

its parts are looked up in memory for purposes of recognition. This process can be very complex

b e c a u s e i n g e n e r a l t h e d e s c r i p t i o n i s a h i e r a r c h i c a l o n e , i n w h i c h p a t t e r n s o f b a s i c p a r t s

themselves form higher level patterns.

Despite considerable progress in identifying basic shapes (Pentland, 1986), the encoding of

the hierarchy of relations among these basic shapes is much more difficult. The problem is that

in encoding this hierarchy one has to keep track of a great many things. First one has to keep

track of the pattern that forms each of the basic parts, in order to identify them. Then one has to

keep track of the next level of patterns among these patterns, and so on. Yet the human visual

system appears to be able to encode complex scenes and to identify them in about a tenth of a

s e c o n d ( e . g . , P o t t e r , 1 9 7 5 ) . T h e c o m p l e x i t y o f t h e r e p r e s e n t a t i o n t h a t i s b u i l t u p c a n b e

appreciated by considering the enormous difficulty people have in building a representation of

even a simple figure from verbal instructions. What appears to be so hard about the latter task is

the problem of retaining substructures while building the next levels of a hierarchy. It seems———————

13. Of course, it remains an open empirical question just how perception-like the processing of information from such “bound” features can be.There have been claims that some illusions – such as the Muller-Lyer illusion – can be created by imagining arrowheads on lines (Bernbaum- -- -- -& Chung, 1981). Although the interpretation of such experiments is not unproblematic, they should not, in any case, be taken as supportingan “imagery” position: for one thing, there is much we don’t know about the locus of such illusions in the visual case.

– 23 –


plausible that one of the things that makes the task easier when the input is visual is that retention

of the substructures is aided by the continued presence of the corresponding part of the scene in

the visual field. This observation suggests a possible role for FINSTs in the encoding of

complex shapes and in the “recognition by parts” process.

FINSTs allow a system to simultaneously refer to several features of an existing pattern as

well as to aggregates (or chunks) of features, and also allows these features to be linked to

symbols in long-term memory. Such indexing and cross-binding of parts of a scene to symbol

structures can occur at many levels of a hierarchical description. This suggests that a description

could be built up level by level, by a process which keeps the working memory load down by

relying on the FINSTing of patterns in the display. An example of how such a process might

work is the following (the details here are highly speculative, the intent is simply to illustrate

how FINSTs might play a role in the process).

When a novel scene is initially prese n t e d , F I N S T s a r e a s s i g n e d to some initial set of

features. A cluster of such indexed features might then be recognized and chunked (perhaps

along the lines suggested by Mahoney and Ullman, 1988). Such a cluster would then be treated

as a single item: its description would be stored in LTM and its token occurrence in the scene

assigned a FINST, which would also be linked to the LTM description. This would free up the

FINSTs that had been bound to its subpart features, and would also allow the chunk as a whole

to be referred to (say in new relational predicates). This new reference capability is an important

step in structure-building in general, and follows a principle that Marr (1982) referred to as “the

principle of explicit naming”.

What has just been described is an example of the familiar hierarchical chunking process,

frequently postulated in theories of learning and memory (e.g. Johnson, 1972). What is special

about the present proposal is the idea that FINSTs provide the placeholders that allow such

descriptions to be built level by level, with chunks at the current level being bound to token parts

of the scene, and their descriptions stored in long-tern memory. Thus at any particular time the

system, in effect, has access to a hybrid entity consisting partly of a symbol structure and partly

of indexed objects in the scene. For example, in encoding the relation between two complex

subfigures which are resting on top of one another, the arguments of the ON-TOP-OF(x,y)

relation might be FINSTs bound to feature aggregates or chunks in the scene, thus obviating the

need to have the description of the substructure simultaneously present in working memory.

This hierarchical chunking process, with each successive level of the hierarchy being built by

reference to the visual display rather than to descriptions held entirely in working memory, uses

much less working memory. It does, however, assume some way to index parts of a figure and

– 24 –


to link them to structures in long-term memory. This is precisely what FINSTs are intended to

provide.

Summary and Conclusion

This paper has presented a number of examples illustrating the usefulness of assuming a

primitive mechanism capable of individuating and dynamically indexing a small number of

features (or feature-clusters) in a visual field. Such an assumption can help illuminate a number

of quite disparate empirical phenomena. It was argued that something very much like the FINST

binding mechanism is independently required for determining where visual operations (such as

those in “visual routines”) are to be applied. FINSTs represent the primary mechanism by which

variables in visual predicates and operations can be bound to particular places or elements in a

stimulus so that they can be evaluated with respect to particular feature-locations in a scene.

In addition to exploring these assumptions –- and suggesting a number of others, such as--

those involving cross-modality binding of visual and motor spaces –- this paper has presented--

some direct evidence bearing on one of the assumptions about properties of FINSTs. The

assumption in question is that FINSTs can pre-attentively track a number of independently

moving visually-identical objects under conditions where it is unlikely that the task is being done

by serial time-sharing.

The wide range of phenomena addressed by this simple, independently motivated postulate

makes it a promising basis for investigating the interface at which attention and higher cognitive

processes are brought to bear on the products of the earliest automatic and preattentive stages of

vision and of visual-motor coordination. Moreover, although this point is beyond the scope of

the present paper, there is also a need for a mechanism such as the FINST to deal with the

p r o b l e m o f a s s i g n i n g s e m a n t i c s t o linguistic exp r e s s i o n s c o n t a i n i n g s p a t i a l i n d e x i c a l s ( l i k e

“here” and “there”) –- a problem that has occupied many people interested in semantics and its--

relation to perception (see, for example, Peacock, 1983).

– 25 –


References

Avant, L.L. (1965). Vision in the Ganzfeld. Psychological Bulletin, 64, 246-258.

Ballard, D.H. (1986). Cortical connections and parallel processing: Structure and function. The

Behavioral and Brain Sciences, 9, 67-120.

Beiderman, I. (1988). Aspects and extensions of a theory of human image processing. In Z.W.

Pylyshyn (ed). Computational Processes in Human Vision: Interdisciplinary Perspectives.

Norwood, N.J.: Ablex Publishing.

Bernbaum, K., and Chung, C. S. (1981). Muller-Lyer Illusion Induced by Imagination, Journal

of Mental Imagery 5:125-128.

Burkell, J.A. and Pylyshyn, Z.W. (1988). Is colour change a primitive visual feature? Cognitive

Science Technical Report 34. Centre for Cognitive Science, University of Western Ontario,

London, Canada.

C o l e s , M . G . , G r a t t o n , G . , B a s h o r e , T . R . , E r i k s e n , C . W . & D o n c h i n , E . ( 1 9 8 5 ) . A

p s y c h o p h y s i c a l i n v e s t i g a t i o n o f t h e c o n t i n u o u s fl o w m o d e l o f h u m a n i n f o r m a t i o n

processing. Journal of Experimental Psychology: Human Perception and Performance, 11,

529-553.

, 110-116. ] Eriksen, C.W. and D.W. Schultz, C.S. (1977). Retinal locus and acuity in visual

information processing Bulletin of the Psychonomic Society, 9:81-84.

Eriksen, C. W. and St. James, J. D. (1986). Visual attention within and around the field of focal

attention: a zoom lens model. Perception and Psychophysics, 40, 225-240.

E s t e s , W . K . , A l l m e y e r , D . H . & R e d e r , S . M . ( 1 9 7 6 ) . S e r i a l p o s i t i o n f u n c t i o n s f o r l e t t e r

identification at brief and extended exposure durations. Perception and Psychophysics, 19,

1-15.

Feldman, J.A. (1985). Four frames suffice: A provisional model of vision and space. The

Behavioral and Brain Sciences, 8. 265-313.

Feldman, J.A. & Ballard, D.H. (1982). Connectionist models and their properties. Cognitive

Science, 6, 205-254.

– 26 –


Finke, R.A. (1979). The functional equivalence of mental images and errors of movement.

Cognitive Psychology, 11, 235-264.

Finke, R. A. and Pinker, S. (1982). Spontaneous imagery scanning in mental extrapolation.

Journal of Experimental Psychology: Learning, Memory and Cognition, 2, 142-147.

Fodor, J.A. and Pylyshyn, Z.W. (1981). How direct is visual perception: Some reflections on

Gibson’s ‘ecological approach’. Cognition, 9, 139-196.

Goldberg M. and Wurtz, R. (1972). Activity of superior colliculus in behaving monkeys. I:

Visual Receptive Fields in Single Neurons, Journal of Neurophysiology, 35, 542-559.

Goodale, M.A., Pelisson, D. & Prablanc, C. (1986). Large adjustments in visually guided

reaching do not depend vision of the hand or perception of target displacement. Nature, 320,

748-750.

Hayes, J.R. (1973). On the function of visual imagery in elementary mathematics. In W. Chase

(ed) Visual Information Processing. New York: Academic Press.

Hochberg, J. (1968). In the Mind’s Eye. In R.N. Haber (ed) Contemporary Theory and Research

in Visual Perception. New York: Holt, Rinehart & Winston.

H o f f m a n , J .E. and Nelson, B. (1981). Spatial selectivity in visual search. Percepti o n a n d

Psychophysics, 30, 283-290.

Hoffman, D. and Richards, W. (1985). Parts of Recognition. In A. Pentland (ed), From Pixels

to Predicates. Norwood, N.J.: Ablex Publishing.

Johnson, N.F. (1972). Organization and the concept of a memory code. In A.W. Melton & E.

Martin (Eds), Coding Processes in Human Memory. New York: Winston.

Jolicoeur, P. (1988). Curve tracing operations and the perception of spatial relations. In Z.W.

Pylyshyn (ed). Computational Processes in Human Vision: Interdisciplinary Perspectives.

Norwood, N.J.: Ablex Publishing, in press.

K a h n e m a n , D . , T r e i s m a n , A . , a n d G i b b s , B . ( 1 9 8 3 ) M o v i n g o b j e c t s a n d s p a t i a l a t t e n t i o n .

P r e s e n t e d a t t h e t h e 2 0 t h A n n u a l M e e t i n g o f t h e P s y c h o n o m i c s S o c i e t y , S a n D i e g o ,

California.

Klahr, D. (1973). Quantification processes, In W. Chase (ed) Visual Information Processing.

New York: Academic Press.

K o c h , C . a n d U l l m a n , S . ( 1 9 8 4 ) . S e l e c t i n g o n e a m o n g t h e m a n y : a s i m p l e n e t w o r k

implementing shifts in selective visual attention. A.I. Memo 770. Cambridge, MA: MIT AI

Lab.

– 27 –


Kosslyn, S.M., Ball, T.M., and Reiser, B. J. (1978). Visual Images Preserve Metrical Spatial

I n f o r m a t i o n : E v i d e n c e f r o m S t u d i e s o f I m a g e S c a n n i n g . J o u r n a l o f E x p e r i m e n t a l

Psychology: Human Perception and Performance, 4:46-60.

Kosslyn, S. M., S. Pinker, G. Smith, and S. P. Shwartz. (1979). On the Demystification of Mental

Imagery, The Behavioral and Brain Science. 2:535-548.

Laberge, D. (1983). Spatial extent of attention to letters and words. Journal of Experimental

Psychology: Human Perception and Performance 9, 371-379.

Marr, D. (1982). Vision. San Fransisco: W.H. Freeman.

Marr, D., and Nishihara, H.K. (1976). Representation and Recognition of Spatial Organization

of Three-Dimensional Shapes, MIT A.I. Memo 377:1-57.

Mahoney, J.V. and Ullman, S. (1988). Image chunking defining spatial building blocks for scene

a n a l y s i s . I n Z . W . P y l y s h y n ( e d ) , C o m p u t a t i o n a l P r o b l e m s i n H u m a n V i s i o n :

Interdisciplinary Perspectives. Norwood, N.J.: Ablex.

Miles, F.A. & Kawano, K. (1987). Visual stabilization of the eyes. Trends in Neurosciences, 10.

153-158.

Mishkin, M., Ungerleider, L.G. and Macko, K.A. (1983). Object vision and spatial vision: two

cortical pathways. Trends in Neuroscience, 6, 414-417.

Eriksen, C.W., and Murphy, T.D. (1987). Movement of attentional focus across the visual field:

A critical look at the evidence. Perception and Psychophysics, 42, 299-305.

Newell, A. (1973). Production Systems: Models of Control Structures, in Visual Information

Processing, ed. W. Chase. New York: Academic Press.

N e w e l l , A . ( 1 9 8 0 ) . H a r p y , p r o d u c t i o n s y s t e m s a n d h u m a n c o g n i t i o n . I n R . C o l e ( E d . ) ,

Perception and Production of Fluent Speech, Hillsdale, N.J.: Erlbaum.

Peacock, C. (1983). Sense and Content. Oxford: Clarendon Press.

Pentland, A. (1987). Recognition by Parts. Proc. ICCV 87, London, June 1987.

Pentland, A. (1986). Perceptual Organization and the Representation of Natural Form. Artificial

Intelligence Journal, 28, 1-38.

Posner, M.I, Nissen, M.J., and Ogden, W.C. (1978). Attended and unattended processing modes:

T h e r o l e o f s e t f o r s p a t i a l l o c a t i on. In H.L. Pick, and I.J. Saltzman (eds), M o d e s o f

Perceiving and Processing Information, Hillsdale, New Jersey: Lawrence Erlbaum.

Potter, M. (1975). Meaning and visual search Science, 187, 965-966.

– 28 –


P y l y s h y n , Z . W . ( 1 9 8 4 ) . C o m p u t a t i o n a n d C o g n i t i o n : T o w a r d a F o u n d a t i o n f o r C o g n i t i v e

Science. Cambridge, Mass.: MIT Press, a Bradford Book.

P y l y s h y n , Z . W . ( 1 9 8 1 ) . T h e I m a g e r y D e b a t e : A n a l o g u e M e d i a v e r s u s T a c i t K n o w l e d g e ,

Psychological Review 88:16-45.

Pylyshyn, Z.W. (1973). What the Mind’s Eye Tells the Mind’s Brain: A Critique of Mental

Imagery Psych. Bulletin, 80, 1-24.

Pylyshyn, Z.W., Elcock, E.W., Marmor, M., and Sander, P. (1978a). Explorations in Visual-

Motor Spaces, Proceedings of the Second International Conference of the Canadian Society

for Computational Studies of Intelligence, University of Toronto.

Pylyshyn, Z.W., Elcock, E.W., Marmor, M., and Sander, P. (1978b). A system for perceptual-

motor based reasoning. Technical Report #42. Department of Computer Science, University

of Western Ontario, London, Ontario, Canada.

Pylyshyn, Z.W. and Storm, R.W. (1989). Tracking of Multiple Independent Targets: Evidence

for a Parallel Tracking Mechanism. Spatial Vision, in press.

Remington, R. & Pierce, L. (1984). Moving attention: Evidence for time-invariant shifts of

visual slective attention. Perception and Psychophysics, 35, 393-399.

Rock, I. (1981). Anorthoscopic Perception, Scientific American 244:145-153.

Rock, I., and Ebenholtz, S. (1962). Stroboscopic Movement Based on Change of Phenomenal

rather than retinal location, American Journal of Psychology, 75:193-207.

Rock I., and Gutman, D. (1981). The Effect of Inattention on Form Perception, Journal of

Experimental Psychology: Human Perception and Performance 7:275-285.

Shepard, R.N. (1978). The mental image. American Psychologist, 33, 125-137.

S h e p a r d , R . N . , a n d P o d g o r n y , P . ( 1 9 7 8 ) . C o g n i t i v e P r o c e s s e s t h a t R e s e m b l e P e r c e p t u a l

Processes, in W.K. Estes (Ed.), Handbook of Learning and Cognitive Processes (Vol. 5),

Hillsdale, N.J.: Erlbaum.

Shulman, G.L., Remington, R.W., and McLean, J.P. (1979). Moving Attention through Visual

S p a c e , J o u r n a l o f E x p e r i m e n t a l P s y c h o l o g y : H u m a n P e r c e p t i o n a n d P e r f o r m a n c e

15:522-526.

Steinbach, M.J. (1988). Muscles as sense organs. Archives of Opthamology. xx-xx.

Stevens, J.K. (1978). The corollary discharge: Is it a sense of position or a sense of space? The

Behavioral and Brain Sciences, 1, 163-164.

– 29 –


T r e i s m a n , A . a n d G e l a d e , G . ( 1 9 8 0 ) . A f e a t u r e i n t e g r a t i o n t h e o r y o f a t t e n t i o n . C o g n i t i v e

Psychology, 12. 97-136.

T r e i s m a n A . a n d K a h n e m a n D . ( 1 9 8 4 ) . A c c u m u l a t i o n o f I n f o r m a t i o n w i t h i n o b j e c t fi l e s .

Presented at the 24th Annual Meeting of the Psychonomics Society. San Diego, California.

Tsal, Y. (1983). Movements of Attention across the Visual Field, Journal of Experimental

Psychology: Human Perception and Performance 9:523-530.

Turvey, M.T. (1977). Contrasting Orientation to the Theory of Visual Information Processing,

Psychology Review 84:67-88.

Ullman, S. (1984). Visual Routines, Cognition 18:97-159.

Wu, J.J. and Caelli, T.M. (in press). On locating objects and recovering their motions: A

predictive method for computational prehension. In M. Goodale (ed), Vision and Action:

The Control of Grasping. Norwood, N.J.: Ablex.

Wright, R.D., Dawson, M.R. and Pylyshyn, Z.W. (1987). Spatio-temporal parameters and the

three-dimensionality of apparent motion: Evidence for two types of processing. Spatial

Vision, 2, 263-272.

Wurtz, R.W., and Mohler, C.W. (1976). Organization of Monkey Superior Colliculus: Enhanced

Visual Response of Superficial Layer Cells, J. Neurophysiol. 39:745-765.

Yantis, S. (1988). On analog movements of visual attention. Perception and Psychophysics, in

press.

– 30 –


Table 1a: Summary of Some Assumptions of the

FINST Model

1. P r i m i t i v e r e t i n t o p i c p r o c e s s e s p r o d u c e f e a t u r e - c l u s t e r s a u t o m a t i c a l l y a n d i n p a r a l l e lacross the retina.

2. C e r t a i n o f t h e s e c l u s t e r s a r e s e l e c t e d o r a c t i v a t e d ( a l s o i n p a r a l l e l ) b a s e d o n t h e i rdistinctiveness within a local neighbourhood (e.g. the so-called “popout” or odd-man-outfeatures). These tend to be feature clusters that are reliably associated with distinct distalor scene features.

3. The activated clusters compete for a finite pool of internal referencing tokens calledFINSTs. This also happens in parallel, and the initial assignment of FINSTs is stimulus-driven. Since the supply of FINSTs is limited, this is a resource-constrained process.

4. The primitive processes that create feature clusters also maintain their integrity: A FINSTthat is bound to a feature cluster keeps being bound to it as the cluster changes its locationcontinuously on the retina. In this way FINSTs “point to” fixed places in a scene withoutidentifying what is being pointed to –- serving like the indexical pronouns “here” or--“there”.

5. Some higher order patterns, consisting of aggregates of primitive feature clusters (e.g.,contours) can also be assigned FINSTs under either top-down or bottom-up control.

6. Only FINSTed feature clusters can enter into subsequent processing: i.e., Relationalproperties like INSIDE(x,y), PART-Of(x,y), ABOVE(x,y), COLLINEAR(x,y,z),… canonly be encoded if features x, y, z,… are FINSTed.

7. There can be some top-down influence in the selection of which activated clusters receiveFINSTs. Higher level processes can, for example, direct a FINST to be placed on certaina l r e a d y a c t i v a t e d f e a t u r e s , d e fi n e d i n t e r m s o f o t h e r F I N S T e d f e a t u r e s ( e . g .INTERSECTION(u:line, v:line), where the two lines are already FINSTed).

– 31 –


Table 1b: Further assumptions involving themotor system

8. Corresponding to the FINST index in vision, which allows objects to be referred to invisual predicates, there is an index called an ANCHOR that allows objects to be referredto in motor commands. Only objects bound to ANCHORs can appear as arguments tothe MOVE(x,y) command (currently the only motor command assumed).

9. Only two moveable objects are assumed in the minimal version of the model: the FOVEAand the POINTER. These are assumed to always be ANCHORed, so that the pointer canbe moved into the FOVEA and the FOVEA can be moved to the location of the pointer.Hence MOVE(FOVEA, POINTER) and MOVE(POINTER, FOVEA) are assumed to beprimitive operations.

10. There is a primitive operation, called BIND, for cross-binding an element bound to aFINST to one bound to an ANCHOR, thus allowing a cross-reference between indexes indifferent modalities, the first step towards visual-motor coordination. This allows aMOVE command to be issued to the location of a feature that was once on the retina,even after the feature is no longer visible. The system can command either of them o v e a b l e o b je c t s t o m o v e t o t h e l o c a t i o n o f t h e A N C H O R b y u s i n g t h e p r i m i t i v eo p e r a t i o n B I N D ( x : F I N S T , y : A N C H O R ) , f o l l o w e d b y e i t h e r M O V E ( F O V E A ,y:ANCHOR) or MOVE(POINTER, y:ANCHOR).

11. Objects that are bound to an ANCHOR (which always includes the POINTER and theFOVEA) can serve in place of FINSTed features when evaluating perceptual predicates(such as ABOVE(x,y), INSIDE(x,y), and so on) even if they are not on the retina. In

other words these two objects provide a limited means for evaluating spatial relationsamong pairs of places when both places are not visible concurrently.

– 32 –

Zenon Pylyshyn