A Model of Spatial Indexing
The Role of Location Indexes in Spatial Perception:A Sketch of the FINST Spatial-index Model
Zenon Pylyshyn
Center for Cognitive ScienceUniversity of Western Ontario
Introduction
Marr (1982) may have been one of the first vision researchers to insist that in modeling
vision it is important to separate the location of visual features from their type. He argued that in
early stages of visual processing there must be “place tokens” that enable subsequent stages of
the visual system to treat locations independent of what specific feature type was at that location.
Thus, in certain respects a collinear array of diverse features could still be perceived as a line,
and under certain conditions could function as such in perceptual phenomena like the Poggendorf
illusion.
The idea that locations and feature-types are encoded independently is not a new one. A
closely related distinction was widely acknowledged in the literature on list-learning and letter-
recognition, where it has long been known that item information could be encoded or retained
independent of order information (e.g., Estes, Allmeyer & Reder, 1976; Coles, Gratton, Bashore,
Eriksen & Donchin, 1985). The view also has considerable support from neurophysiology,
where evidence has been accumulating for two visual pathways, one specialized for location and
the other for identification (Mishkin, Ungerleider & Macko, 1983). Recently, this idea has
receved additional support from the finding that “conjunction illusions” occur under certain
conditions (Treisman & Gelade, 1980). Conjunction illusions are visual illusions in which one
A Model of Spatial Indexing
property of a feature (e.g., colour) is mistakenly conjoined with another property (e.g., shape)
which is also present in the stimulus, though at another location. For example, a display
consisting of a green X and a red O may be reported as consisting of a red X and a green O. If
the locations and types of such features were inextricably linked in the internal perceptual
encoding (as they are in any pictorial representation), such illusions would not occur.
The sort of exchange of conjuncts has led to the postulation of independent “feature maps”
with only weak cross-bindings. Another, closely related way to view this phenomenon is in
terms of the dissociation of feature-type and feature-location. Then the cross-talk results from
the failure to bind two properties of the same feature. Feldman and Ballard (1982) view focused
attention as the mechanism typically use to fix the location of a visual object and thereby allow
independent feature-properties associated with that object to be bound together. The importance
of focused attention for feature conjoining is also stressed by Treisman & Gelade (1980). Two
empirical observations implicate the importance of directed, focused attention in the encoding of
feature location: (1) conjunction illusions only occur when attention is shared with another task,
and (2) time to search for conjunctions of features is linear in the number of objects in the search
set, whereas search for a single feature is independent of the number of objects in the display (i.e.
visually primitive features exhibit “pop-out”).
There is, however, an unsatisfactory aspect of the view that in order to encode the location
of a feature one must focus attention on that feature. Visual attention, according to the widely1held view, is unitary: it can only be directed to one place, or at least to only one local region at a
time and must be scanned from place to place in order to examine several places. Yet we clearly
can analyze patterns distributed over many places. In fact, in Marr’s example, the detection of a
pattern like collinearity of features requires that in some sense the location of all the features in
the set be available at one time, so that the collinearity of those features, rather than some others,
could be ascertained. In other words in evaluating the predicate COLLINEAR(x , x , …, x ) the1 2 n
arguments x , x , …, x must in some way be bound to specific tokens of the relevant visual1 2 n
features so that the evaluation takes place with respect to those very features in the scene.
Of course, there are many ways in which the arguments of such a visual predicate might
refer to (or be bound to) the locations of features. The locations might, for example, be explicitly
e n c o d e d a s C a r t e s i a n c o o r d i n a t e s a n d t h e c o d e s f o r t h e s e c o o r d i n a t e s a s s o c i a t e d w i t h t h e
arguments of the predicate, in which case the evaluation might proceed by checking that this
———————
1. There has been some discussion in the literature concerning how broad a region can be covered within focal attention and whether the scope ofthis focal region can be varied (e.g., Eriksen & St. James, 1986). Nonetheless, there is general agreement that there is only one region of focalattention, as opposed to several independent and noncontiguous regions (see, also, the discussions in Hoffman & Nelson, 1981; Laberge,1983).
– 2 –
A Model of Spatial Indexing
array of coordinates forms a linear sequence. But this proposal seems implausible for a number
of reasons: (a) encoding of feature locations in terms of their (x,y) retinal coordinates would
presumably require that the relevant objects be first scanned sequentially and stored (on the
assumption that encoding such properties as coordinates, like the encoding of conjunctions of
features, requires focused attention), (b) the (x,y) coordinate seems too precise an encoding,
especially where larger features are involved, and (c) if the retinal coordinates were the basis for
location encoding it would be difficult to use them to detect patterns among moving elements. In
what follows, an alternative mechanism will be proposed for giving the cognitive system access
to places in the visual field at which some visual features are located, without assuming an
explicit encoding of the locations within some coordinate system, nor an encoding of the feature
type.
One of the assumptions of the present approach is that there is a pre-attentive or pre-
cognitive mechanism in the visual system for individuating features (or making particular feature
tokens conceptually distinct from the others), and for indexing their locations within the visual
field. The terms pre-attentive or pre-cognitive are used here in order to emphasize that the
hypothesized indexing process is an extremely primitive one that precedes such operations as the
recognition of patterns or the encoding of the relative locations of visual features. The basic idea
is that there is something the visual system must do before it can even begin to discern a spatial
pattern or spatial relations among component features in a display: it must “pick out” or as we
prefer to put it, “individuate” the features among which it will recognize some spatial relations,
such as the relations “above”, “part of”, “inside” and so on. Before you can determine that “this”
is inside “that” you must have a way to, in effect, “point to” the two features to which the
“inside” relation will apply.
In order to accomplish this “pointing” there is no need to have first recognized what feature-
types are being pointed to. All one needs is a way to pick out or index the locations of the
feature-tokens in question. The simple idea that individuation precedes explicit encoding leads
to one of the basic postulates of the present work (the notion of FINSTs), which in turn provides
some the tools needed to illuminate a number of other puzzles, including some phenomena
involving visual imagery. It also provides the basis for some very preliminary steps towards a
computational theory of perceptual-motor coordination.
– 3 –
A Model of Spatial Indexing
Indexing the location of visual features using FINSTs
To help characterize the idea behind the FINST mechanism in a concrete way, imagine the
following. Suppose you place each of your fingers on a different object (or feature-token) in a
scene. Now imagine that the objects are moving about or that you are changing your position
while your fingers keep in contact with the objects. Even if you do not know anything at all
about what is located at the places that your fingers are touching, you are still in a position to
determine such things as whether the object that finger number 1 is touching is to the left of or
above the object that finger number 2 is touching, or whether the object that finger number 3 is
touching is larger than the object that finger number 4 is touching. Of course, you may not be
able to determine this directly without further analysis (e.g. by haptic exploration), but your
finger-contact gives you a way to, in effect, refer to the objects so that some further processing of
them can be undertaken. You do not first have to search for an object that meets some particular
description, because you have a direct way to locate or index relevant scene-tokens for further
processing. Moreover, the access that the finger-contact gives makes it inherently possible to
track a particular token –- i.e., to keep referring to what is, in virtue of its historical trace, the--
same object, independent of its location in space. In this way you can individuate an object, keep
it conceptually distinct from other objects, and continue to do so as it moves about.
Such a direct mechanical indexing makes it possible to do something that cannot be done
directly in vision. Touch provides a way of indicating to oneself, and therefore of thinking,
“this” object or “that” object, for any object being touched, independent of its location in space.
The parallel case in vision appears to be different, since it would seem that the equivalent of
having direct causal contact with a feature in a 3-D scene is not possible. That’s because the
only visual sensors we have are ones that respond to the 2-D retinal projection of the scene (cf
Pylyshyn, 1984; Fodor & Pylyshyn, 1981), and the mapping from the 3-D object-features to the
2-D retinal features is not in general reversible. Nonetheless –- and this is a central assumption--
of the FINST model –- we can do something very analogous to pointing. The present approach--
p o s i t s a m e c h a n i s m c a l l e d a F I N S T , w h i c h a l l o w s o n e t o a c c o m p l i s h , i n a l i m i t e d w a y ,
something that is functionally similar to indexing a feature in a 3-D scene, much as a finger
allowed us to index such a feature in the tactile example discussed above (which is why this type
of index was originally called an “INSTantiation FINger”, abbreviated as “FINST”).
A FINST is, in fact, a reference (or index) to a particular feature or feature-cluster on the
retina. However, a FINST has the following additional important property: Because of the way
clusters are primitively computed, a FINST keeps pointing to the “same” feature cluster as the
– 4 –
A Model of Spatial Indexing
2cluster moves across the retina. If the retinal feature cluster identified in this way maintains a
reliable correlation (over time) with some particular feature of the distal scene, then the FINST
will succeed in pointing to that distal feature, independent of its location on the retina. Thus
distal features which are currently projected onto the retina can be indexed thorugh the FINST
mechanism in a way that is transparent to their retinal location. In addition, they allow the
system to relate particular retinal features to parts of the symbolic representation of the scene
being constructed. It does this by associating the FINST that is bound to a particular feature-
token, with the corresponding part of the internal representation of the scene.
Notice that a FINST is very different from an encoding of the position of a feature. The
FINST itself does not encode any properties of the feature in question, it merely makes it
possible to locate the feature in order to examine it further if needed. Like the “chunks” in short-
term memory (see Johnson, 1972; Newell, 1973) FINSTs are opaque to the properties of the
objects to which they refer. For example, although one FINST may have been activated by a
colour and another by a shape appearing in the visual field, both could be properties of the same
object. Yet one could not tell by examining the two FINSTs that they referred to two properties
located at the very same place (i.e., that they were distinct properties of the same visual element).
Conversely, one could not tell by examining two FINSTs that they referred to two different
places that had the some property. Unlike coordinates or other explicit codes, the only way to
tell whether two FINSTs refer to the same location or property is by accessing the features that
are indexed by the FINSTs in question and determining whether the locations or properties (and
not the FINSTs themselves) were identical. Thus there is a fundamental difference between
F I N S T s a n d v a r i o u s p o s s i b l e e n c o d i n g s a s s o c i a t e d w i t h f e a t u r e s ( e . g . e n c o d i n g s o f t h e i r
properties or their locations in some coordinate system).
One of the main purposes of FINSTs is to allow higher cognitive processes to refer to
s p e c i fi c v i s u a l f e a t u r e s i n e v a l u a t i n g c e r t a i n s p a t i a l - r e l a t i o n p r e d i c a t e s t h a t a p p l y t o t h e s e
features (some additional uses for FINSTs, which will be discussed later, include allowing a link3to be established between high-level descriptions of a scene and particular places in that scene,
———————
2. One might imagine, for example, a process operating in parallel over the retinal array and aggregating points that correspond to a putative edgeor other scene feature. There are a number of simple possible mechanisms that can be used to make a FINST a “sticky” reference – i.e., to---ensure that it keeps being attached to some particular feature-cluster independent of the retinal location of that feature. For example, thetraditional way of representing an aggregated set of points in a computer vision system is by maintaining a “list of contributing points” foreach cluster or aggregate. Whatever the method of aggregation, the “sticky reference” property of FINSTs could be accomplished by simplyfollowing the policy of assigning the same symbolic name to lists of contributing points from successive frames of view, if a significant subsetof the coordinates on the list remains the same. We are currently also experimenting with several different, and perhaps more psychologicallyplausible implementations of this identity-maintenance property of FINSTs. One is a network implementation, similar to the one that Koch &Ullman (1984) developed for modelling selective attention. The other is in the spirit of “token matching” approaches, using a predictive filtertechnique similar to that developed by Wu and Caelli (in press) for object tracking. These are the only portions of the FINST model that wehave attempted to implement so far, which deals with real digitized images (although see footnote 9).
3. We sometimes speak of FINSTs as indexing places in a scene, in order to emphasize that it is feature-location rather than feature-type that isbeing indexed. However, it should be kept in mind that the theory only provides for filled places to be indexed in this way, not places in atotally empty region of the visual field.
– 5 –
A Model of Spatial Indexing
and allowing motor commands to refer to the locations of these features in order to direct limbs
or eye movements to them). Being able to index particular features is particularly important
when encoding relational properties involving several places. For example, the assumption is
made that in order for the cognitive system to encode a relational property holding among several
places –- such as COLLINEAR(x,y,z) or INSIDE(u,v) or PARALLEL(m:line, n:line) –- the-- --
arguments to these predicates must first be bound to features or places in the scene, i.e. FINSTs
must be assigned to the locations of the relevant features. Once assigned, groups of features or
“chunks” may also be formed, and under certain conditions a FINST may be assigned to the
entire chunk. A chunk which has a FINST bound to it may or may not also have FINSTs bound
to its component parts. However, in order to evaluate an n-place predicate, such as PART-
OF(x:element, c:chunk), all its arguments have to be bound to FINSTs.
The question of how FINSTs are assigned in the first instance remains open, although it
seems reasonable that they are assigned primarily in a stimulus-driven manner, perhaps by the
activation of locally-distinct properties of the stimulus –- particularly by new features entering--
the visual field. Indeed, there is evidence (Burkell & Pylyshyn, 1988) that some transients, such
as luminance changes, and not others, such as isoluminant colour changes, do attract FINSTs. In
addition, under certain conditions top-down processes may also play a role in specifying which
of the potential active features get assigned a FINST.
Because of their pivotal role in enabling relational encoding to take place, FINSTs occupy a
c r i t i c a l p l a c e i n v i s u a l p r o c e s s i n g . T h i s m a k e s i t p a r t i c u l a r l y t e m p t i n g t o v i e w t h e m a s
representing a resource-constraint bottleneck, similar in spirit to the hypothesized limit on the
number of chunks that may be held in short-term memory, or even closer in spirit to Newell’s
(1980) assumption of a cost associated with each variable that gets bound in the matching of
conditions in a production system. This assumption is indeed part of the provisional picture of
the FINST mechanism. Notice that with this assumption, if both a chunk and its parts are
indexed (as would be required in order to determine whether a certain feature is PART-OF
another) this requires more resources than if the chunk alone (or the parts alone) are indexed, as
seems reasonable on intuitive grounds.
As will be apparent, many of the assumptions surrounding the FINST idea are highly
provisional. Many questions concerning the properties of the FINST mechanism await further
empirical exploration. Nonetheless, there are already numerous consequences (to be discussed
below) of those of the present assumptions that seem most secure. The assumptions discussed
above are summarized in Table 1a for reference.
– 6 –
A Model of Spatial Indexing
==============================
Insert Table 1 about here
==============================
Before discussing some empirical studies dealing with the role of FINSTs in visual attention,
another general argument for the need for such indexes will be sketched. This argument centres
around the problem of how we attain a stable representation of perceptual space. This, in turn,
leads to some additional assumptions concerning the role of FINSTs and other (non-visual)
indexes in motor control.
Constructing a spatially stable representation
Although the input to the visual system consists of continually moving images on the retina,
we nonetheless perceive a world that remains stable with respect to a global frame of reference.
This suggests that while there is an early stage at which the visual system operates upon a
retinotopic representation, there must be a later stage at which locations of perceived scene
features are encoded in relation to a frame of reference that is fixed in space (or at least a 2-D
projection of such a coordinate system).
It has been fairly traditional to assume that the geostability of visual perception means that
there is a representation which consists of a global image of the scene, fixed in distal (or world)
coordinates. The usual idea is that people construct and update such a geostable image of a
s c e n e b y “ p a i n t i n g ” t h e r e t i n o t o p i c r e p r e s e n t a t i o n o n t o a n e x t e n d e d 2 - D i m a g e o f t h e
environment, and that this image depicts a scene which is fixed within a geostable frame of4reference. In such models, the effect of eye movements is typically neutralized by locating the
point at which the retinal information is transferred to the extended image so that it is in exact
correspondence with the direction of gaze. One version of this view, called the “corollary
discharge” theory, claims that an “efference copy” of the signal going to the eye muscles is also
sent to the mechanism that superimposes the retinal image on the extended internal geostable
image. The fact that we can integrate information from different glances has typically been
taken as strong suppport for the view that there is a global stable image at some stage in visual———————
4. For example, Feldman’s (1985) model of spatial perception posits a stable “feature frame” representation, onto which the retinotopicrepresentation is mapped. Although Feldman’s feature frame is a global representation, it differs considerably from the simple “global-image”views: it is an active parameter-space representation, not a matrix corresponding to the 2D projection of the world into which the retinotopicinformation is deposited. Feldman uses “value units” to induce a mapping between the two frames (much as is done with the Hough transformmapping from images to parameter spaces; see Ballard, 1986). This differs from the present approach, which does not map the entireretinotopic representation onto some global space at all, but only provides indexes to selected FINSTed features, and cross-bindings to adescriptive symbolic representation.
– 7 –
A Model of Spatial Indexing
processing. Such a view is widely accepted, even though the details of the registration process
are far from settled. (The facts concerning the relation between visual stability, gaze, and
various sources of information (such as motor efferents) appear to be open to question (see, for
example, Stevens, 1978; Steinbach, 1988; Miles & Kawano, 1987). Even the critical role of eye
movements is questionable since Hochberg (1968) has shown that under certain conditions
“glances” presented passively over the same retinal position can be perceptually integrated.)
One might ask whether the assumption of a global geostable image is necessary in order to
a c c o u n t f o r t h e p h e n o m e n a o f g e o s t a b i l i t y , o r w h e t h e r t h e r e l e v a n t p h e n o m e n a m i g h t b e
compatible with a simpler mechanism. To answer this question we must first be clear about the
empirical considerations that need to be addressed by such a mechanism. The basic one,
mentioned earlier, is that the world does not appear to move as our eyes move. Although
reliance on such phenomenology is generally considered problematic (for example, there is
evidence that people can respond to movements in the perceptual world of which they are not
consciously aware –- see, e.g., Goodale, Pelisson & Prablanc, 1986), there is no doubt that the--
p h e n o m e n a l e x p e r i e n c e o f s t a b i l i t y i s a n i m p o r t a n t r e a s o n f o r p o s t u l a t i n g a g e o s t a b l e
representation.
Another relevant consideration is the observation that certain perceptual phenomena appear
to depend on scene coordinates rather than retinal coordinates. For example, there is evidence
that apparent motion is sensitive to scene coordinates (Rock and Ebenholtz, 1962), and that the
“correspondence problem” (Ullman, 1979) may be solved in scene coordinates as well; although
in some of these cases it is an open question whether these processes operate over a 2-D or over
a 3-D representation (see Wright, Dawson & Pylyshyn, 1988). This too has suggested to people
that there is a stage in visual perception where the information is encoded as a global geostable
image.
Finally another straightforward, and from our perspective even more important aspect of
geostability, is the fact that perceived space connects with the motor system in a stable and
globally consistent manner. If we point to some object we perceive, the direction we point is
independent of where the projection of that object falls on our retina: it depends on where the
object is in scene coordinates.
What capacities must a system possess in order to be able to exhibit these phenomena, which
are characteristic of geostability? In order for a system to exhibit spatial stability, the following,
at least, should be true. First, under certain conditions of movement of features on the retina, the
system must be able to identify the sequence of features that correspond to the same place in the
– 8 –
A Model of Spatial Indexing
scene. Second, the system must have some way to refer to the location of features that are not on
the retina (i.e. recalled features) in order to detect patterns that extend beyond the range of the
retina itself. Third, the system must have some way to coordinate movements (whether eye
movements or pointing) with the locations of both retinal and recalled (non-retinal) features. The
first two requirements recognize the need for some sort of coordination between retinal features
and off-retina features which allows one to identify sequences of proximal features as arising
from the same distal feature, even if the sequence is discontinuous and interrupted (i.e. as the
proximal feature moves off the retina and back again in the course of eye movements). The third
r e q u i r e m e n t r e c o g n i z e s t h a t p a r t o f g e o s t a b i l i t y c o n c e r n s t h e c r o s s - b i n d i n g o f l o c a t i o n s i n
perceptual and motor reference frames.
Although how these three requirements are met by the nervous system is far from clear,
there is at least reason to doubt that the task requires the piecewise “painting” of an extended
internal image. Indeed, the task does not even appear to require that locations of features be
explicitly encoded (say, in terms of their Cartesian coordinates), only that some means be
available for indexing the features so that they can be addressed by primitive perceptual and
motor operations. Consider how the FINST mechanism might provide a way to meet the first5requirement. To get an idea of how this might work, recall the “pointing fingers” analogy
discussed in the previous section. The use of mechanically-linked tactile sensors obviated the
need for an explicit encoding of global locations. One did not need such a global image in that
case because all the information needed for evaluating spatial-relation predicates remained in the
scene and could be accessed as required by using the tactile-links as indexes.
By hypothesis, FINSTs provide a precisely analogous way of indexing a number of feature-
places in a scene independent of their retinal locations. This, in turn, provides the basis for
achieving some of the effects that can be derived from an extended geostable image. For
example, if FINSTs provide the reference points for determining the relative perceived locations
of features, then the fact that FINSTs remain bound to distal features as the eyes move about6means that their relative perceived locations will remain invariant with eye movements. . This is
exactly what happened in the case of the tactile example discussed earlier, where fingers were
used as indexes to distal features. The approximate transparency of reference to scene features
———————
5. The following discussion should not be read as suggesting that the ability to index features using the FINST mechanism explains how humansachieve visual stability. Indeed, it seems quite likely that in the human visual system a variety of mechanisms take part in achieving this sortof stability – including monitoring both efferent and afferent signals from several sources, as well as monitoring a variety of dynamic visual---patterns, such as optic flow. The point of this discussion is simply to suggest that FINSTs may be sufficient for the task, and therefore to arguethat an extended internal image is not entailed by the facts of visual stability or the stability of visual-motor orientation.
6. Without the benefit of perceptually distinct features in a visual scene, to which perception can anchor stable referents, it is very difficult evento achieve visual stability. Thus vision in the Ganzfeld (or structureless visual field) is unstable and people lose sense of where their eyes arepointing or where they had previously been pointing. Indeed motion and form perception are both seriously affected after 90 seconds ofGanzfeld exposure (Avant, 1965)
– 9 –
A Model of Spatial Indexing
that the FINST mechanism makes possible, means that as long as the relative locations of
indexed objects remain fixed in the scene, their perceived relative spatial locations will not
change even though their retinal locations are changing.
The second requirement on a system that can exhibit geostability properties, mentioned
above, was that it could integrate retinal information with information that is no longer on the
retina (but which might have been part of a previous glance). The “extended internal image” idea
is intended, in part, to allow the perceptual integration of these two types of information by
providing a representation that contained both types of features which then could be examined by
some subsequent mechanism (the “mind’s eye”). To see whether such a representation is need
for this purpose, we need to consider the nature of the stored off-retinal information and the type
of integration that is possible.
Only a little is known concerning this question. For example, it is known that off-retinal
information is encoded in a sufficiently partial or abstract form that it does not enter into
perceptual processes in precisely the same way as retinal information. Off-retinal information
does not combine with retinal information to produce certain perceptual phenomena that occur
when all the information is retinal. For example, impossible figures (such as the “devil’s
pitchfork”) are not easily detected if the distances between inconsistent portions is large, or if the
i n f o r m a t i o n i s p r e s e n t e d i n “ g l a n c e s ” i n s o m e a r b i t r a r y o r d e r . S i m i l a r l y , t h e a u t o m a t i c
interpretation of certain line drawings as depicting three dimensional objects does not occur as
readily if the parts of the figure are far apart or are not presented in an appropriate order or if the
segmentation of the scene into glances fails to present entire critical features in individual
glances (Hochberg, 1968). Such results suggest that the off-retinal information is not ‘visual’ in
the same way that retinal information is, but rather is abstract and conceptual –- much like the--
information in a mental image (Pylyshyn, 1981). In general it is easy to overestimate the amount
of visually reinterpretable information available at off-retinal locations. Indeed, if one could
index off-retinal locations in some way (as will be hypothesized below) then a simple label
attached to such an index (e.g. “concave edge,” “convex edge,” “outside boundary,” etc.) might
provide all the information needed to account for such things as the anorthoscope or eye-of-the-7needle effect (see Rock, 1981). This conclusion is suggested by Hochberg’s (1968) finding that
perception of forms presented in a sequence of preprogrammed “glances” only occurs if the
order of the glances is one that enables the identity of individual contours to be tracked (e.g. in
———————
7. In an unpublished study, Ian Howard showed that when an image is moved behind a slit at medium speeds in the anorthoscope (i.e. at speedsslow enough to avoid self-masking), the ability to recognize the pattern depends on the memory requirements of the task. If the image seenthrough the slit has many contours that must be followed, the task is more difficult than if there are few contours, even though the image maybe of exactly the same geometric form in the two cases, except for orientation (e.g. one might consist of a form such as “E” while the otherconsists of the same form rotated by 90 degrees: the former requires that three contours be tracked as the figure moves horizontally behind avertical slit, while the latter requires only one). At sufficiently slow speeds the advantage of the fewer-contour version disappears.
– 10 –
A Model of Spatial Indexing
presenting a rectangle one would have to present the sides in an order which preserves their
connectivity –- in either a clockwise or counterclockwise cycle).--
T h u s i t a p p e a r s t h a t t h e f a c t s o f p e r c e p t u a l i n t e g r a t i o n m a y n o t r e q u i r e a n y t h i n g a s
extravagant as a global image. They do, however, require something more than has been
assumed in the FINST hypothesis so far; it requires a mechanism for evaluating relational
predicates involving both retinal and nonretinal places. In addition, we need a mechanism to
deal with the third requirement listed above; a way to relate the locations of perceived features to
motor commands. We shall return to both these issues in the next section.
Indexing for motor commands: Binding features to ANCHORs
In order to extend the usefulness of the FINST mechanism beyond the case where all
indexed information remains on the retina, it is neccessary to address such additional questions
as: (1) How does the system maintain the identity of a feature cluster when the cluster disappears
off the retina and later reappears? and (2) How, in general, does the system compute the spatial
relation among features that are not on the retina concurrently? These are difficult problems
because it is clear that their solution depends on proprioceptive as well as visual information, and
also because they involve memory. While it is not known how the human visual system
manages to achieve the skills referred to above, the present approach has been to ask first for
sufficient conditions for it to be possible.
Clearly we can represent proprioceptive information and we can issue motor commands that
result in our eyes or limbs moving to desired locations. Let us put aside, for the moment, the
question of how this is done. Let us assume that the ability to issue a certain limited set of motor
commands, which cause a limb or eye to move to selected sensed locations, is part of our
primitive perc e p t u a l - m o t o r c a p a c i t y . W h a t e v e r t h e m e c h a n i s m s b y w h ich these things are
accomplished, they are sure to be quite different from those with which we are currently familiar
–- such as those being used in the design of industrial robot arms.--
The strategy adopted in developing the present highly provisional and speculative ideas
c o n c e r n i n g s o m e p r o b l e m s o f p e r c e p t u a l - m o t o r c o o r d i n a t i o n m i g h t b e c a l l e d a m i n i m a l8mechanism strategy. In understanding how a cognitive or perceptual-motor function could be
———————
8. The term “minimal” is used here in an informal sense to suggest that the mechanisms appear to embody the smallest set of assumptionsnecessary for accomplishing the task – though there is no proof that no “simpler” mechanism is possible, and indeed the very notion of---
– 11 –
A Model of Spatial Indexing
accomplished (how it is possible), one approach is to attempt first to discern the nature of what
has been called the ‘task demands’. There are a number of ways to approach this goal. One
way, championed by Marr (1982) is to attempt to develop a ‘theory of the computation’; an
abstract theory of the input-output function computed by the system which relates the function to
a goal (what, in the natural life of the organism, the function is for) and specifies some conditions
under which the goal can, in principle, be satisfied. This strategy has been extremely successful
in guiding research towards the discovery of a variety of ‘natural constraints’ among visual
properties.
T h e r e a r e , h o w e v e r , o t h e r h e u r i s t i c s t r a t e g i e s f o r a p p r o a c h i n g t h e d i f fi c u l t p r o b l e m o f
understanding the nature of ‘task demands’. An alternative strategy, which is the one adopted
here, is to take a small set of simple capacities that people appear to possess and see whether the
assumption that these are primitive operations in the human organism allows one to develop a
model that is sufficient for the task at hand, and which also accounts for certain otherwise
puzzling phenomena. This sort of minimalist top-down strategy has occasionally been used to
advantage in designing computational models. Good examples are Newell’s (1973) production
system architecture, and Marr and Nishihara’s (1976) SPASAR mechanism for rotating 3-D
models into a canonical orientation in the process of recognition. Inasmuch as it is also an
attempt to work out a set of basic operations which can be used to create a procedure for carrying
out the task, it is very similar in spirit to Ullman’s (1984) hypothesis of a set of basic operations
which form ‘visual routines’ for detecting spatial relationships in visual stimuli.
T h e s i m p l e o p erations that were initially assume to be primit i v e a r e t h o s e t h a t a s s i g n
FINSTs (which have already been discussed), together with a ‘MOVE’ operation which is
capable of causing certain objects to move to specified locations. As in the basic idea behind the
FINST hypothesis (wherein only FINSTed objects can serve as arguments to visual predicates),
it is assumed that only places that are indexed in the appropriate way can serve as arguments to
the MOVE command. The index that fills the role for the motor system, corresponding to the
FINST index in the visual system, is called an ANCHOR. Thus an ANCHOR is like a FINST,
except it indexes a place in motor-command space (and perhaps in proprioceptive space). One
might think of it as a reference to a place whose position can be accessed by the motor system in
just the way that FINSTed places (see footnote 3) can be accessed by the visual system.
In the simplest version of this speculative model, the only movable objects that have been
postulated are the centre of the visual field (think of this as a pair of cross-hairs) called the
———————simplicity used here is not made explicit. The mechanisms are minimal in the sense that a Turing Machine is a minimal mechanism forcomputing, namely it is very elementary, yet sufficient for the task.
– 12 –
A Model of Spatial Indexing
‘FOVEA’, and another object (think of this as the end of a limb) called the ‘POINTER’. The
reason for beginning with such a simple and restrictive set of objects is that this provides a way
t o e x p l o r e t h e q u e s t i o n o f w h a t a d d i t i o n a l a s s u m p t i o n s a r e n e e d e d i n o r d e r t o b e a b l e t o
command these movable objects to move to places that are seen (i.e. are currently on the retina)
as well as to places that were previously seen but must now be recalled from memory. In other
words, one is asking what operations appear to be demanded by the nature of the task being9examined.
Since only ANCHORed objects can appear as arguments in the MOVE command, the two
movable objects are assumed to be automatically bound to an ANCHOR. In order to be able to
MOVE these objects to both seen and unseen places, a new operation is required that can cross-
b i n d F I N S T s a n d A N C H O R s . T h i s a d d i t i o n a l o p e r a t i o n , d e s i g n a t e d B I N D ( x : F I N S T ,
y:ANCHOR), is what makes it possible not only to coordinate between modalities, but also
allows features that were detected visually (by being FINSTed) to be later referred to by the
motor system –- even after they are no longer visible. This is done by first cross-binding the--
relevant FINST to an ANCHOR, and then issuing a command to move one of the movable
objects to the location of that (currently invisible) feature. Since one of the objects that can be
moved is the FOVEA, this allows the eye to be moved back to an object that has left the visual10field . Inasmuch as the number of both FINSTs and ANCHORs is limited, this process can only
be carried out in a restricted way.
The only other assumption that is needed to account for the limited ability being explored is
the assumption that certain relational visual predicates, like LEFT-OF or ABOVE, can apply to
sets of features not all of which need be visible, so long as each is bound to either a FINST or to
an ANCHOR. These additional assumptions of the model are summarized in Table 1b.
A summary of the way that FINSTs function in indexing visual features, and providing a
way to cross-reference them to both the evolving internal description and the motor system, is
shown diagrammatically in Figure 1.———————
9. The original problem that was investigated (described in Pylyshyn, Elcock, Marmor and Sander, 1978a) was to determine a minimal set ofassumptions that were necessary in order for a system to be able to draw simple diagrams from a description, and to reason about them in asimple way – e.g. to discover ‘new’ properties that emerged as the diagram was being drawn. Such a simple system was, in fact, implemented---in a Planner-like language called POPLER 1.5 and is described in Pylyshyn, Elcock, Marmor, and Sander (1978b). The model onlyimplemented a version of the mechanism which associates features of the diagram with an evolving symbolic description (the diagram was notphysically drawn, but merely simulated, so that the actual vision component was not implemented). The primary purpose of thisimplementation was to examine empirically whether the ideas sketched herein could in fact serve as the basis of a working system: inparticular, whether the minimal mechanism was sufficient for drawing and keeping track of figures that were larger than the retina. It would,of course, have been preferable to prove mathematically some results about the limits of a system based on these principles, but it was felt tobe premature to undertake such an analysis, given the extremely provisional nature of the assumptions under investigation.
10. The assumption that only ANCHORed/FINSTed objects can be the targets of movements receives some support from single cell recordingstudies. Goldberg and Wurtz (1972), and Wurtz and Mohler (1976) have shown that at the level of the superior colliculus, the firing rate ofcells whose receptive field coincides with the target of an eye movement increases, with the increase occurring well before the eye movementitself begins. This suggests that the activation of such cells may correspond to the binding of FINSTs to ANCHORs at these locations prior tothe issuance of a motor command.
– 13 –
A Model of Spatial Indexing
==============================
Insert Figure 1 about here
==============================
To summarize: FINSTs allow internal representations to refer to places in a visual scene that
have not yet been assigned unique descriptions. In addition, they allow multiple references to be
made simultaneously, and also allow the motor system to, in effect, issue commands to move a
limb to certain visually perceived locations. The capacity to make such indexical references in
vision has far-reaching implications. A few of the consequences of this primitive mechanism for
explaining various empirical phenomena will be discussed below.
An empirical demonstration of the FINST mechanism:
Tracking multiple independent targets
Perhaps the easiest way to illustrate the FINST hypothesis in a concrete manner is to
describe an experiment intended to be a fairly direct test of several of the basic assumptions
behind this notion.
Consider the following experiment (for more details, see Pylyshyn & Storm, in press).
Suppose subjects are shown a field of identical randomly arranged points and are required to
keep track of some subset of them (called the “targets”) –- as they must if their task is to count--
the targets, or to indicate when one of them flickers or moves. In such a task, subjects might
proceed by encoding the location of each of the targets with respect to either a local or global
frame of reference, thus making it possible to distinguish and keep track of each target by its
coordinates. The encoding of relative positions might be facilitated by noticing a pattern formed
by the points, thereby “chunking” the set in a single mnemonic pattern. What clearly would not
work in this situation is to remember visual characteristics of the target subset, since the targets
and non-targets are visually identical.
Now suppose the points are set into random independent motion, and the subject is required
to indicate (by pressing a button) whenever one of the target objects briefly changes its shape, or
to indicate (by pressing another button) whenever a non-target briefly changes its shape. In this
– 14 –
A Model of Spatial Indexing
case the distinctiveness of each point cannot be attributed to its location, since this is continually
changing. Hence storing a code for the location of each point would not help to solve the
problem, unless the location code is updated sufficiently frequently. The update frequency
would have to be such that during the time between updates the target remained within a small
region where it would not be confused with some nearby non-target. If location codes have to be
assigned in series by moving attention to each in turn (as most people believe), this would entail
sampling and encoding locations according to some sampling schedule in which points are
scanned in sequence. If one had some idea of the maximum rate at which points could be visited
and their locations encoded, it might be possible to design a display sequence that would cause
this strategy to fail –- say because the points would have moved far enough during the sample--
interval that there was a high probability that another point was now in the place occupied earlier
by the point whose location code one was attempting to update. Under such conditions, subjects
should no longer be able to do the multiple-tracking task described above.
Such an experiment was in fact carried out, and is summarized described below. The
following, however, was the conclusion: Using some widely accepted assumptions concerning
the location encoding process, it was found that subjects could do very much better at this task
than predicted by the sequential encoding procedure. What, then is a possible mechanism for
carrying out this task? If the assumptions and analysis of the experimental situation are correct,
it appears that subjects are able to simultaneously keep track of at least 4, or perhaps even 5 or
more distinct features in the visual field, without encoding their location relative to a global
frame of reference (e.g., without using some explicit symbolic location code). This is precisely
what the FINST hypothesis claims: it says that there is a primitive referencing mechanism for
pointing to certain kinds of features, thereby maintaining their distinctive identity without either
recognizing them (in the sense of categorizing them), or explicitly encoding their locations.
Now consider the details of the experiment. Based on some preliminary studies it was
determined that subjects could track at least 4 randomly-moving points (in the shape of “+”
signs) in a total field of 8 such randomly-moving points, and could detect whether a probe (a
square flashed for 83 msec) occurred on a target, a non-target, or at some other location. In order
to design the task in such a way as to preclude its solution by a sequential-sampling procedure,
appeal was made to the generally held view that in order to encode the location of a point, a
subject must attend to that point. As Anne Treisman and others have shown (e.g. Treisman &
Gelade, 1980), noticing that a stimulus contains a certain feature is not the same as noticing
where that feature is: the two can be functionally dissociated. In order for the information about
location to be available for such purposes as identifying where the point is in relation to some
frame of reference or some other fea t u r e , i t s e e m s t h a t t h e f e a t u r e h a s t o b e a t tended to.
– 15 –
A Model of Spatial Indexing
Furthermore, it is widely believed (see the scanning velocity references listed below) that this
sort of attention is unitary –- i.e. there is only one attention locus which must be moved from--
place to place. Attending, according to this view, entails actually moving a locus of focused
attention (without necessarily moving the eye) to that location. A substantial number of studies
now exist which conclude that a single locus of processing must be moved about in the visual
field and that the movement is continuous (although there are some investigators who disagree
with one or another of the single-locus or the continuous-movement assumptions; see below for
references). Since the FINST hypothesis represents an alternative way in which “attention” may,
in effect, get from one location to another (viz, the system might access a feature through one
index and then access another feature through a second index, and thus not have to scan across
the intervening space), it must be shown that a procedure based on serial scanning could not
account for observed results.
The velocity with which attention appears to move within the visual field has been estimated
by various researchers, using quite different techniques, to range from 30 to 250 degrees per
second (i.e. from 33 to 4 msecs/degree). For example, Ericson & Schultz (1977) provide the
slowest measure of scanning velocity as 30.3 deg/sec; Joliceour, Ullman & Mackay (1985) found
contour-following to take from 38.5 to 41.7 deg/sec; Shulman, Remington & McLean (1979)
give 52.6 deg/sec for visual scanning; Tsal’s (1983) more direct measurement yields 117.6
deg/sec and Posner, Nissen & Ogden (1978) provide the fastest figure of 250 deg/sec. Many of
these estimates have been questioned; indeed, there has been some criticism of the general
methodology which led many people to conclude that attention must move continuously through
intermediate positions (Remington & Pierce, 1984; Eriksen & Murphy, 1987; Yantis, 1988).
However, if one accepts the widely-held view that attention is unitary and moves continuously,
then 250 degrees/sec (or 4 msec/degree) would certainly appear to be an upper bound on the
speed with which it can move.
Now if the minimum path length required to scan all 4 points being tracked is known, the
dispersion of the points and their velocity can be set so as to ensure that the scan-and-encode
method will frequently mistake a distractor for a target. This was done by a combination of
making the mean speed of movement of the points sufficiently high (8 degrees/second), the meanopath length sufficiently long (about 34 ), the predictability of the location of a point from its
current velocity and direction sufficiently low (by changing object velocity and direction often),
and the total tracking time sufficiently high (about 4 seconds), and by ensuring that a target is
never more than 1.5 degrees from a distractor.
– 16 –
A Model of Spatial Indexing
The determination of the probability of erroneously switching to tracking a distractor prior
to the time of the probe was done by actually simulating a sequential scan of the very stimuli
used in the experiment, and having the simulated process pick the object nearest the encoded
location at each sampling cycle. Time and distance parameters used in the simulation were
based on measurements made on the actual displays used in the experiment. The sequence of
displays were examined and the shortest path covering all 4 targets was measured on each frame
(then averaged over the entire trial). This distance, together with different assumed values of
attention-scanning velocities, were used to obtain the appropriate intersample time. Several
d i f f e r e n t s c a n n i n g s t r a t e g i e s w e r e s i m u l a t e d , i n c l u d i n g a c o m p l e x s t r a t e g y b a s e d o n t h e
assumption that subjects detected the speed and direction of the sampled point and used this to
project, and store, the location at which the point was expected to be when next sampled. An
additional “sophisticated guessing strategy” was also simulated. This assumes that subjects can
reliably detect the occurence of a probe event even when the event does not occur on an object
being tracked, and also assumes (rather unrealistically, though conservatively) that subjects can
discern when they have lost track of targets. In this case, subjects could guess on whenever a
probe occurs on a “lost” trial by randomly selecting one of the three possible responses (i.e.
indicating whether the probe occurred on the target, nontarget, or neither).
The predicted performance derive d f r o m t h e s e s i m u l a t i o n s , t o g e t h e r w i t h t h e observed
performance, are shown in Figure 2, plotted as a function of the velocity of attention scanning.
S i n c e t h e u p p e r l i m i t f o r s c a n n i n g v e l o c i t y i s t a k e n ( a g a i n r a t h e r c o n s e r v a t i v e l y ) t o b e 4
msec/degree, the results show clearly that subjects are not sampling the points in a sequence of
move-encode operations. (The details of the experimental design and the serial scan model used
in the prediction are described in Pylyshyn & Storm, in press).
==============================
Insert Figure 2 about here
==============================
The conclusion, then, is that the 4 targets are being tracked in parallel, and that the tracking
is not based on encoding the locations of points with respect to some frame of reference, but
rather is based on a simple dynamically-maintained indexing scheme such as that proposed by11the FINST hypothesis.
———————
11. Since the above experiment was reported, a number of other studies have been carried out using different equipment (i.e., a CommodoreAmiga computer) made it possible to achieve smooth movement with faster speeds, to use up to 6 targets whose shapes were varieddynamically, and to construct trajectories that avoided collisions by simulating an inverse-square law repulsion about each object (instead ofby making discrete direction changes just prior to a potential collision, as in the present study). It was found that in this more complex setup,
– 17 –
A Model of Spatial Indexing
Implications for Visual Routines
This section returns to a consideration of the relevance of FINSTs for the computation of
spatial relations. Ullman (1984) has examined a number of spatial properties that the human
visual system can compute with apparent ease, and has asked how this might be done. In the
case of many spatial relations (e.g. “inside”) it is difficult to see how the relation could be
computed by a purely parallel process, without any sequential scanning of the display, since it
requires checking on the relation between the location of a point and an arbitrary curve. All the
possible algorithms that Ullman considers involve some serial process, such as “painting” a
region beginning either at the point in question or at places along the curve, or extending radial
lines from the point in question and noticing the parity of their crossings with the curve. In all
these cases Ullman concludes that “The execution of visual routines requires a capacity to
control locations at which elemental operations are applied (Ullman, 1984, p135).” The same
also appears to be true for the detection of a number of other visual properties, such as whether
two points lie on the same contour (e.g., Jolicoeur, 1988).
Although it is clear that a capacity to control locations at which processing is carried out is
necessary, it does not follow that this must involve moving a unitary locus of attention to that
location, or encoding the location in some explicit way. Thus while there is a sense in which
Ullman may be correct in claiming that “The marking of a location for later reference requires a
coordinate system…with respect to which the location is defined”, there is no need to assume
that the location has to be defined by an explicit set of coordinate codes.
T h e F I N S T m e c h a n i s m s h o w s h o w o n e c a n m a r k a l o c a t i o n i n a m a n n e r t h a t w i l l
subsequently allow processing to be directed to it if, for example, some process needs to access
information indexed to that location, or even if a spatially local focus of attention has to be
directed to that location for some reason. FINSTs also provide a way for the location to be
referred to in certain primitive motor commands (since a FINST can be cross-bound to an
ANCHOR). Yet FINSTs do not themselves make an encoding of the location of features
available to high-level processes, such as ones that compute spatial relations between the place in
question and other places. They simply make it possible for appropriate processes to obtain
access to such information; they do this by indexing the feature in question in the actual display.
———————experienced subjects were able to perform even better than those in the original experiment, due primarily to the relative ease of trackingsmoothly accelerating motion.
– 18 –
A Model of Spatial Indexing
E n c o d i n g s p a t i a l i n f o r m a t i o n a l w a y s r e q u i r e s a d d i t i o n a l p r o c e s s i n g m e c h a n i s m s . M o s t
predicates, and virtually all of Ullman’s proposed visual routines, require that more than one
place be indexed prior to the computation proceeding. Furthermore, many of the studies on
perceptual attention scanning use peripheral cues to induce the movement of attention (e.g.
Posner, Nissen & Ogden, 1978). In these cases the cue itself would have to be located in order to
serve as a directional indicator, and this would have to be done prior to attention being shifted to
it. Thus it is clear that indexing is quite different from “attending” in the usual sense, where this
term is understood to mean a single spatial focus of processing.
Consider the following examples of relations requiring visual routines. Figure 3 shows
some stimuli used to illustrate tasks requiring visual routines (several of these were reported by
Ullman, 1984). In panel (a) the task is to decide whether point x (or x’) is inside the contour. In
panel (b) the task is to say whether points x and y (or x and y’) are on the same contour (as in the
studies reported by Jolicoeur, 1988). In panel (c) the task is to say whether there is a path from
the centre of the circle to the circle itself. In panel (d) the task is to say how many points there
are. In panel (e) the task is to say whether the 3 objects are collinear. Notice that in each case
the task cannot be done without indexing several visual objects. In some cases all the objects in
question are points. In others, such as (a) and (c), they include contours.
==============================
Insert Figure 3 about here
==============================
It is not known whether entire contours can be FINSTed, though there are some reasons for
thinking that at least simple contours, or short smooth segments of larger contours can. For
example, Rock and Gutman (1981) have demonstrated that people can attend selectively to a
contour of one colour when it is intertwined with a similar contour of a different colour, as
shown by their inability to recognize the unattended contour as one they had seen before. In
other relevant experiments, Treisman & Kahneman (1983), and Kahneman, Treisman & Gibbs
(1983) showed that a letter presented briefly in a particular moving box primes recognition for
that letter with particular effectiveness when the letter recurs in the same box, even when the box
is in a new location. This suggests that subjects can track the movements of contours such as
boxes, and that they also use the identity of an object (such as a box in this case) to index other
associated properties. On the other hand, the fact that the difficulty in evaluation the “inside”
predicate depends to some extent on the size and shape of the bounding contour –- at least when--
the contour becomes sufficiently complex –- suggests that FINSTing entire contours may not be--
– 19 –
A Model of Spatial Indexing
a s i m p l e p r i m i t i v e o p e r a t i o n . T h e p r e s e n t ( p r o v i s i o n a l ) a s s u m p t i o n i s t h a t s o m e l a r g e r
aggregates can indeed be FINSTed. However, it may be that an entire contour such as that in
fi g u r e 3 a r e q u i r e s s e v e r a l F I N S T s t o c o v e r d i s t i n c t s e g m e n t s o f t h e c u r v e , a n d t h a t h o w
accurately a FINST localizes features depends on such factors as their distinctiveness and on
how many features compete for the pool of available FINSTS (recall that the FINST allocation
process is resource-limited).
In any case it is clear that some pre-attentive indexing must be going on. The assumption
that there are limits on the number of such indexes that can be simultaneously maintained also
seems plausible. The tracking experiment described above suggests that at least 4 (and possibly
as many as 5 or 6) FINSTs are possible –- and this number is a lower bound estimate obtained in--
a task designed to be particularly difficult. Subitizing (which requires that objects be marked
rapidly as they are counted) suggests about 4 FINSTs in that case (e.g. Klahr, 1973). It seems
likely that the amount of information that can be indexed in this way might be increased by
“chunking” patterns and then FINSTing the entire chunk (see, for example, Mahoney & Ullman,
1988), much as the amount of information held in short-term memory can be increased by
chunking. Clearly there remain many unanswered empirical questions concerning exactly what
kinds and how many features can be FINSTed, though the principle that a number of different
features can be indexed in something like the way assumed by the FINST hypothesis seems well
supported.
Implications for studies of mental imagery
One of the phenomena that led to develop the FINST hypothesis in the first place was the
w i d e s p r e a d a s s u m p t i o n t h a t i n c e r t a i n k i n d s o f r e a s o n i n g p e o p l e c o n s t r u c t a n d e x a m i n e a
representation that has many of the properties of a picture (e.g., it has intrinsic metrical and
geometrical properties). This is typically what is meant in referring to a representation as an
“image”. The nature of this representation is assumed to be similar whether constructed from
retinotopic information in perception, or from long-term memory in the course of imagining.
The question of whether the facts of spatial stability of perception requires such a representation
has already been raised. In this section, certain evidence will be examined which is frequently
taken to show the existence of such a representation, constructed in a mental workspace in the
course of reasoning.
Among the phenomena that have led some people to assume the existence of a spatially
extended object referred to as an “image” are such findings as the increased time it takes subjects
– 20 –
A Model of Spatial Indexing
to report properties of imagined objects when they are instructed to imagine the objects as
‘smaller’, the increased time it takes to mentally scan longer distances in an image, as well as
certain other parallels between imagery and perception (such as motor adaptation to imagined
errors in pointing, which parallel adaptation to observed errors in pointing induced by displacing
prisms; Finke, 1979). Pylyshyn (1981) argues that at least the scanning phenomena, and perhaps
other similar phenomena as well, are due to the demands of the task, and in particular to subjects’
tacit knowledge of what would happen in the real situation being imagined, rather than to any
intrinsic properties of how images are represented. There is one case, however, that does not
appear to be subject to the task-demand criticism; this is where images are “projected” onto some
visual scene (e.g. Finke & Pinker, 1982). In this case the phenomena do not disappear when
instructions are changed appropriately. However, it appears that the scanning results in such
cases can be accounted for by the FINST hypothesis without the need to posit a spatially-
extended internal representation.
Before discussing how the FINST hypothesis can deal with these scanning results, consider
how the FINST idea might be relevant to projected-image tasks in general. A particularly simple
illustration of how the FINST hypothesis can deal with such phenomenon is an experiment
described by Shepard and Podgorny (1978). In one version of this experiment (described in
Shepard, 1978), a subject inspects a grid on which a pattern, such as the capital letter “F”, is
outlined. A small spot appears briefly on the display, and the subject must press one of two
buttons; one if the spot occurs in a grid square within the letter, the other if it occurs in one of the
grid squares not inside the letter. Reaction time was found to vary systematically with the
location of the spot on the display: it is generally shorter when the spot is inside the letter and is
shortest when the square on which it occurs lies at the intersection of two or more letter-strokes
(i.e. at an “L” or a “T” vertex of the block letter). What was most interesting, however, is that
exactly the same pattern of results is found when the subject is asked to imagine the letter on the
grid, rather than being shown the actual letter on the screen.
The results of this and other similar studies (e.g. Hayes, 1973) have usually been taken as
evidence that there is a superposition of two stable extended “images”, of the sort discussed
earlier. However, it now appears that in these cases the results can be accounted for quite simply
be appealing to the same mechanism that was used earlier to explain certain phenomena of
geostability, which also do not appear to require an extended internal ‘image’. All one needs to
assume is that in both the perceptual and imaginal conditions, the subject prepares for the task by
placing FINSTs on selected features, or aggregates of perceptually-integral features, such as grid
squares, or even rows or columns of such squares that make up letter strokes. Since this
mechanism allows the subject to index actual places in the display (i.e. particular grid squares) –---
– 21 –
A Model of Spatial Indexing
whether or not there is actually something graphically distinct about those grid squares –- the--
task of deciding whether the probe appears on one of these indexed places is carried out visually12in both cases.
Morever, the systematic pattern of reaction times in both visual and imaginal-visual cases
can be explained in exactly the same way. In neither case is an internal image required, only the
ability to index feature-clusters. For example, consider how the presence of FINSTs might be
used to explain why reaction time to a spot is shorter at a vertex than in the middle of a stroke.
One might develop a stochastic race model in which FINSTs to places that are indexed as figure-
strokes are followed in parallel, and a positive response made when the first such index is found
to lead to a region (assumed to be marked by texture elements or by a grid) which is also indexed
by a probe FINST. If strokes are independently indexed, then there are two paths to a vertex and
only one to the middle of a stroke; hence the time to verify the vertex location would be less than
the mid-stroke location. Whatever the merits of such a speculative model, notice that the same
explanation would hold for the visual case as for the “projected image” case.
The differences between this view and the conventional “image” position are substantial,
and have far-reaching implications for theories of cognitive processing. The standard imagery
view (e.g. as put forward by Kosslyn, Pinker, Smith & Shwartz, 1979, and others) hypothesizes a
r e p r e s e n t a t i o n t h a t i s p h y s i c a l l y r e a l i z e d i n s o m e a n a l o g u e m e d i u m w i t h c e r t a i n i n t r i n s i c
Euclidean or metrical properties. This approach assumes that it is these analogue properties of
the medium itself that explain such things as, for example, the increased reaction time that occurs
with increased image-distance scanned (since, according to this view, there is a real physical
analogue to “distance” in the representation of the image –- an analogue that obeys the physical--
law distance = speed x time).
But note that in the case where imagined places are “projected” onto a visual scene, we do
not need to appeal to particular properties of an internal analogue medium in order to explain
certain psychophysical phenomena, such as those involving “scanning”. We need only appeal to
the relevant properties of the real scene, together with some plausible assumptions concerning
t h e p e r c e p t u a l p r o c e s s ( f o r e x a m p l e , t h a t p e r c e p t i o n c a n v e r i d i c a l l y e n c o d e c e r t a i n s p a t i a l
relationships that hold among FINSTed elements in a scene). The FINST mechanism also makes
it possible to associate conceptual “labels” with such places, and thus could in principle enable
the perceptual-motor system to behave in certain (restricted) respects as though particular kinds
———————
12. The Shepard and Podgorny result can be obtained without using a grid (as in the Hayes study) although the version that uses grids isdescribed above for simplicity of exposition. For present purposes, all that is required is that there be some visual features that can act asreference points for locating places where filled squares and/or target points occur. Surface texture elements are sufficient for this purpose.As has already remarked, such texture elements are neccessary for even the most rudimentary visual stability to occur (footnote 6).
– 22 –
A Model of Spatial Indexing
13of features actually were located at these places. . For example, it could allow attention to be
“scanned” to such indexed places, with or without actual eye movements. If that were the case,
then the increased time taken when the scanned distances are greater would simply be the
consequence of a physical law, since real physical distances, not representations of distances,
were being traversed.
Forming visual descriptions
FINSTs may also play an important role in the process of encoding and recognizing a scene.
There is considerable evidence now that the encoding of visual information in memory is a
p r o c e s s o f f o r m i n g i n t e r n a l d e s c r i p t i o n s ( s e e , f o r e x a m p l e , t h e d i s c u s s i o n o f t h i s p o i n t i n
P y l y s h y n , 1 9 7 3 ) . T h e r e h a s b e e n a g r e a t d e a l o f e x p l o r a t o r y r e s e a r c h i n b o t h A r t i fi c i a l
Intelligence and in Cognitive Science on the nature of the basic component shapes (i.e., on the
vocabulary of the descriptions). Such primitive shape elements play an important role in an
approach to object recognition called ‘recognition by parts’ (e.g. Pentland, 1987; Hoffman &
Richards, 1985; Beiderman, 1988). In this approach, an unknown object is described in terms of
t h e s h a p e - c l a s s o f i t s c o n s t i t u ent parts, the transformat i o n s o f t h e s e c a n o n i c a l s h a p e s ( e . g .
rotated, tapered, elongated, etc), and the relations among them. As the description is being built,
its parts are looked up in memory for purposes of recognition. This process can be very complex
b e c a u s e i n g e n e r a l t h e d e s c r i p t i o n i s a h i e r a r c h i c a l o n e , i n w h i c h p a t t e r n s o f b a s i c p a r t s
themselves form higher level patterns.
Despite considerable progress in identifying basic shapes (Pentland, 1986), the encoding of
the hierarchy of relations among these basic shapes is much more difficult. The problem is that
in encoding this hierarchy one has to keep track of a great many things. First one has to keep
track of the pattern that forms each of the basic parts, in order to identify them. Then one has to
keep track of the next level of patterns among these patterns, and so on. Yet the human visual
system appears to be able to encode complex scenes and to identify them in about a tenth of a
s e c o n d ( e . g . , P o t t e r , 1 9 7 5 ) . T h e c o m p l e x i t y o f t h e r e p r e s e n t a t i o n t h a t i s b u i l t u p c a n b e
appreciated by considering the enormous difficulty people have in building a representation of
even a simple figure from verbal instructions. What appears to be so hard about the latter task is
the problem of retaining substructures while building the next levels of a hierarchy. It seems———————
13. Of course, it remains an open empirical question just how perception-like the processing of information from such “bound” features can be.There have been claims that some illusions – such as the Muller-Lyer illusion – can be created by imagining arrowheads on lines (Bernbaum- -- -- -& Chung, 1981). Although the interpretation of such experiments is not unproblematic, they should not, in any case, be taken as supportingan “imagery” position: for one thing, there is much we don’t know about the locus of such illusions in the visual case.
– 23 –
A Model of Spatial Indexing
plausible that one of the things that makes the task easier when the input is visual is that retention
of the substructures is aided by the continued presence of the corresponding part of the scene in
the visual field. This observation suggests a possible role for FINSTs in the encoding of
complex shapes and in the “recognition by parts” process.
FINSTs allow a system to simultaneously refer to several features of an existing pattern as
well as to aggregates (or chunks) of features, and also allows these features to be linked to
symbols in long-term memory. Such indexing and cross-binding of parts of a scene to symbol
structures can occur at many levels of a hierarchical description. This suggests that a description
could be built up level by level, by a process which keeps the working memory load down by
relying on the FINSTing of patterns in the display. An example of how such a process might
work is the following (the details here are highly speculative, the intent is simply to illustrate
how FINSTs might play a role in the process).
When a novel scene is initially prese n t e d , F I N S T s a r e a s s i g n e d to some initial set of
features. A cluster of such indexed features might then be recognized and chunked (perhaps
along the lines suggested by Mahoney and Ullman, 1988). Such a cluster would then be treated
as a single item: its description would be stored in LTM and its token occurrence in the scene
assigned a FINST, which would also be linked to the LTM description. This would free up the
FINSTs that had been bound to its subpart features, and would also allow the chunk as a whole
to be referred to (say in new relational predicates). This new reference capability is an important
step in structure-building in general, and follows a principle that Marr (1982) referred to as “the
principle of explicit naming”.
What has just been described is an example of the familiar hierarchical chunking process,
frequently postulated in theories of learning and memory (e.g. Johnson, 1972). What is special
about the present proposal is the idea that FINSTs provide the placeholders that allow such
descriptions to be built level by level, with chunks at the current level being bound to token parts
of the scene, and their descriptions stored in long-tern memory. Thus at any particular time the
system, in effect, has access to a hybrid entity consisting partly of a symbol structure and partly
of indexed objects in the scene. For example, in encoding the relation between two complex
subfigures which are resting on top of one another, the arguments of the ON-TOP-OF(x,y)
relation might be FINSTs bound to feature aggregates or chunks in the scene, thus obviating the
need to have the description of the substructure simultaneously present in working memory.
This hierarchical chunking process, with each successive level of the hierarchy being built by
reference to the visual display rather than to descriptions held entirely in working memory, uses
much less working memory. It does, however, assume some way to index parts of a figure and
– 24 –
A Model of Spatial Indexing
to link them to structures in long-term memory. This is precisely what FINSTs are intended to
provide.
Summary and Conclusion
This paper has presented a number of examples illustrating the usefulness of assuming a
primitive mechanism capable of individuating and dynamically indexing a small number of
features (or feature-clusters) in a visual field. Such an assumption can help illuminate a number
of quite disparate empirical phenomena. It was argued that something very much like the FINST
binding mechanism is independently required for determining where visual operations (such as
those in “visual routines”) are to be applied. FINSTs represent the primary mechanism by which
variables in visual predicates and operations can be bound to particular places or elements in a
stimulus so that they can be evaluated with respect to particular feature-locations in a scene.
In addition to exploring these assumptions –- and suggesting a number of others, such as--
those involving cross-modality binding of visual and motor spaces –- this paper has presented--
some direct evidence bearing on one of the assumptions about properties of FINSTs. The
assumption in question is that FINSTs can pre-attentively track a number of independently
moving visually-identical objects under conditions where it is unlikely that the task is being done
by serial time-sharing.
The wide range of phenomena addressed by this simple, independently motivated postulate
makes it a promising basis for investigating the interface at which attention and higher cognitive
processes are brought to bear on the products of the earliest automatic and preattentive stages of
vision and of visual-motor coordination. Moreover, although this point is beyond the scope of
the present paper, there is also a need for a mechanism such as the FINST to deal with the
p r o b l e m o f a s s i g n i n g s e m a n t i c s t o linguistic exp r e s s i o n s c o n t a i n i n g s p a t i a l i n d e x i c a l s ( l i k e
“here” and “there”) –- a problem that has occupied many people interested in semantics and its--
relation to perception (see, for example, Peacock, 1983).
– 25 –
A Model of Spatial Indexing
References
Avant, L.L. (1965). Vision in the Ganzfeld. Psychological Bulletin, 64, 246-258.
Ballard, D.H. (1986). Cortical connections and parallel processing: Structure and function. The
Behavioral and Brain Sciences, 9, 67-120.
Beiderman, I. (1988). Aspects and extensions of a theory of human image processing. In Z.W.
Pylyshyn (ed). Computational Processes in Human Vision: Interdisciplinary Perspectives.
Norwood, N.J.: Ablex Publishing.
Bernbaum, K., and Chung, C. S. (1981). Muller-Lyer Illusion Induced by Imagination, Journal
of Mental Imagery 5:125-128.
Burkell, J.A. and Pylyshyn, Z.W. (1988). Is colour change a primitive visual feature? Cognitive
Science Technical Report 34. Centre for Cognitive Science, University of Western Ontario,
London, Canada.
C o l e s , M . G . , G r a t t o n , G . , B a s h o r e , T . R . , E r i k s e n , C . W . & D o n c h i n , E . ( 1 9 8 5 ) . A
p s y c h o p h y s i c a l i n v e s t i g a t i o n o f t h e c o n t i n u o u s fl o w m o d e l o f h u m a n i n f o r m a t i o n
processing. Journal of Experimental Psychology: Human Perception and Performance, 11,
529-553.
, 110-116. ] Eriksen, C.W. and D.W. Schultz, C.S. (1977). Retinal locus and acuity in visual
information processing Bulletin of the Psychonomic Society, 9:81-84.
Eriksen, C. W. and St. James, J. D. (1986). Visual attention within and around the field of focal
attention: a zoom lens model. Perception and Psychophysics, 40, 225-240.
E s t e s , W . K . , A l l m e y e r , D . H . & R e d e r , S . M . ( 1 9 7 6 ) . S e r i a l p o s i t i o n f u n c t i o n s f o r l e t t e r
identification at brief and extended exposure durations. Perception and Psychophysics, 19,
1-15.
Feldman, J.A. (1985). Four frames suffice: A provisional model of vision and space. The
Behavioral and Brain Sciences, 8. 265-313.
Feldman, J.A. & Ballard, D.H. (1982). Connectionist models and their properties. Cognitive
Science, 6, 205-254.
– 26 –
A Model of Spatial Indexing
Finke, R.A. (1979). The functional equivalence of mental images and errors of movement.
Cognitive Psychology, 11, 235-264.
Finke, R. A. and Pinker, S. (1982). Spontaneous imagery scanning in mental extrapolation.
Journal of Experimental Psychology: Learning, Memory and Cognition, 2, 142-147.
Fodor, J.A. and Pylyshyn, Z.W. (1981). How direct is visual perception: Some reflections on
Gibson’s ‘ecological approach’. Cognition, 9, 139-196.
Goldberg M. and Wurtz, R. (1972). Activity of superior colliculus in behaving monkeys. I:
Visual Receptive Fields in Single Neurons, Journal of Neurophysiology, 35, 542-559.
Goodale, M.A., Pelisson, D. & Prablanc, C. (1986). Large adjustments in visually guided
reaching do not depend vision of the hand or perception of target displacement. Nature, 320,
748-750.
Hayes, J.R. (1973). On the function of visual imagery in elementary mathematics. In W. Chase
(ed) Visual Information Processing. New York: Academic Press.
Hochberg, J. (1968). In the Mind’s Eye. In R.N. Haber (ed) Contemporary Theory and Research
in Visual Perception. New York: Holt, Rinehart & Winston.
H o f f m a n , J .E. and Nelson, B. (1981). Spatial selectivity in visual search. Percepti o n a n d
Psychophysics, 30, 283-290.
Hoffman, D. and Richards, W. (1985). Parts of Recognition. In A. Pentland (ed), From Pixels
to Predicates. Norwood, N.J.: Ablex Publishing.
Johnson, N.F. (1972). Organization and the concept of a memory code. In A.W. Melton & E.
Martin (Eds), Coding Processes in Human Memory. New York: Winston.
Jolicoeur, P. (1988). Curve tracing operations and the perception of spatial relations. In Z.W.
Pylyshyn (ed). Computational Processes in Human Vision: Interdisciplinary Perspectives.
Norwood, N.J.: Ablex Publishing, in press.
K a h n e m a n , D . , T r e i s m a n , A . , a n d G i b b s , B . ( 1 9 8 3 ) M o v i n g o b j e c t s a n d s p a t i a l a t t e n t i o n .
P r e s e n t e d a t t h e t h e 2 0 t h A n n u a l M e e t i n g o f t h e P s y c h o n o m i c s S o c i e t y , S a n D i e g o ,
California.
Klahr, D. (1973). Quantification processes, In W. Chase (ed) Visual Information Processing.
New York: Academic Press.
K o c h , C . a n d U l l m a n , S . ( 1 9 8 4 ) . S e l e c t i n g o n e a m o n g t h e m a n y : a s i m p l e n e t w o r k
implementing shifts in selective visual attention. A.I. Memo 770. Cambridge, MA: MIT AI
Lab.
– 27 –
A Model of Spatial Indexing
Kosslyn, S.M., Ball, T.M., and Reiser, B. J. (1978). Visual Images Preserve Metrical Spatial
I n f o r m a t i o n : E v i d e n c e f r o m S t u d i e s o f I m a g e S c a n n i n g . J o u r n a l o f E x p e r i m e n t a l
Psychology: Human Perception and Performance, 4:46-60.
Kosslyn, S. M., S. Pinker, G. Smith, and S. P. Shwartz. (1979). On the Demystification of Mental
Imagery, The Behavioral and Brain Science. 2:535-548.
Laberge, D. (1983). Spatial extent of attention to letters and words. Journal of Experimental
Psychology: Human Perception and Performance 9, 371-379.
Marr, D. (1982). Vision. San Fransisco: W.H. Freeman.
Marr, D., and Nishihara, H.K. (1976). Representation and Recognition of Spatial Organization
of Three-Dimensional Shapes, MIT A.I. Memo 377:1-57.
Mahoney, J.V. and Ullman, S. (1988). Image chunking defining spatial building blocks for scene
a n a l y s i s . I n Z . W . P y l y s h y n ( e d ) , C o m p u t a t i o n a l P r o b l e m s i n H u m a n V i s i o n :
Interdisciplinary Perspectives. Norwood, N.J.: Ablex.
Miles, F.A. & Kawano, K. (1987). Visual stabilization of the eyes. Trends in Neurosciences, 10.
153-158.
Mishkin, M., Ungerleider, L.G. and Macko, K.A. (1983). Object vision and spatial vision: two
cortical pathways. Trends in Neuroscience, 6, 414-417.
Eriksen, C.W., and Murphy, T.D. (1987). Movement of attentional focus across the visual field:
A critical look at the evidence. Perception and Psychophysics, 42, 299-305.
Newell, A. (1973). Production Systems: Models of Control Structures, in Visual Information
Processing, ed. W. Chase. New York: Academic Press.
N e w e l l , A . ( 1 9 8 0 ) . H a r p y , p r o d u c t i o n s y s t e m s a n d h u m a n c o g n i t i o n . I n R . C o l e ( E d . ) ,
Perception and Production of Fluent Speech, Hillsdale, N.J.: Erlbaum.
Peacock, C. (1983). Sense and Content. Oxford: Clarendon Press.
Pentland, A. (1987). Recognition by Parts. Proc. ICCV 87, London, June 1987.
Pentland, A. (1986). Perceptual Organization and the Representation of Natural Form. Artificial
Intelligence Journal, 28, 1-38.
Posner, M.I, Nissen, M.J., and Ogden, W.C. (1978). Attended and unattended processing modes:
T h e r o l e o f s e t f o r s p a t i a l l o c a t i on. In H.L. Pick, and I.J. Saltzman (eds), M o d e s o f
Perceiving and Processing Information, Hillsdale, New Jersey: Lawrence Erlbaum.
Potter, M. (1975). Meaning and visual search Science, 187, 965-966.
– 28 –
A Model of Spatial Indexing
P y l y s h y n , Z . W . ( 1 9 8 4 ) . C o m p u t a t i o n a n d C o g n i t i o n : T o w a r d a F o u n d a t i o n f o r C o g n i t i v e
Science. Cambridge, Mass.: MIT Press, a Bradford Book.
P y l y s h y n , Z . W . ( 1 9 8 1 ) . T h e I m a g e r y D e b a t e : A n a l o g u e M e d i a v e r s u s T a c i t K n o w l e d g e ,
Psychological Review 88:16-45.
Pylyshyn, Z.W. (1973). What the Mind’s Eye Tells the Mind’s Brain: A Critique of Mental
Imagery Psych. Bulletin, 80, 1-24.
Pylyshyn, Z.W., Elcock, E.W., Marmor, M., and Sander, P. (1978a). Explorations in Visual-
Motor Spaces, Proceedings of the Second International Conference of the Canadian Society
for Computational Studies of Intelligence, University of Toronto.
Pylyshyn, Z.W., Elcock, E.W., Marmor, M., and Sander, P. (1978b). A system for perceptual-
motor based reasoning. Technical Report #42. Department of Computer Science, University
of Western Ontario, London, Ontario, Canada.
Pylyshyn, Z.W. and Storm, R.W. (1989). Tracking of Multiple Independent Targets: Evidence
for a Parallel Tracking Mechanism. Spatial Vision, in press.
Remington, R. & Pierce, L. (1984). Moving attention: Evidence for time-invariant shifts of
visual slective attention. Perception and Psychophysics, 35, 393-399.
Rock, I. (1981). Anorthoscopic Perception, Scientific American 244:145-153.
Rock, I., and Ebenholtz, S. (1962). Stroboscopic Movement Based on Change of Phenomenal
rather than retinal location, American Journal of Psychology, 75:193-207.
Rock I., and Gutman, D. (1981). The Effect of Inattention on Form Perception, Journal of
Experimental Psychology: Human Perception and Performance 7:275-285.
Shepard, R.N. (1978). The mental image. American Psychologist, 33, 125-137.
S h e p a r d , R . N . , a n d P o d g o r n y , P . ( 1 9 7 8 ) . C o g n i t i v e P r o c e s s e s t h a t R e s e m b l e P e r c e p t u a l
Processes, in W.K. Estes (Ed.), Handbook of Learning and Cognitive Processes (Vol. 5),
Hillsdale, N.J.: Erlbaum.
Shulman, G.L., Remington, R.W., and McLean, J.P. (1979). Moving Attention through Visual
S p a c e , J o u r n a l o f E x p e r i m e n t a l P s y c h o l o g y : H u m a n P e r c e p t i o n a n d P e r f o r m a n c e
15:522-526.
Steinbach, M.J. (1988). Muscles as sense organs. Archives of Opthamology. xx-xx.
Stevens, J.K. (1978). The corollary discharge: Is it a sense of position or a sense of space? The
Behavioral and Brain Sciences, 1, 163-164.
– 29 –
A Model of Spatial Indexing
T r e i s m a n , A . a n d G e l a d e , G . ( 1 9 8 0 ) . A f e a t u r e i n t e g r a t i o n t h e o r y o f a t t e n t i o n . C o g n i t i v e
Psychology, 12. 97-136.
T r e i s m a n A . a n d K a h n e m a n D . ( 1 9 8 4 ) . A c c u m u l a t i o n o f I n f o r m a t i o n w i t h i n o b j e c t fi l e s .
Presented at the 24th Annual Meeting of the Psychonomics Society. San Diego, California.
Tsal, Y. (1983). Movements of Attention across the Visual Field, Journal of Experimental
Psychology: Human Perception and Performance 9:523-530.
Turvey, M.T. (1977). Contrasting Orientation to the Theory of Visual Information Processing,
Psychology Review 84:67-88.
Ullman, S. (1984). Visual Routines, Cognition 18:97-159.
Wu, J.J. and Caelli, T.M. (in press). On locating objects and recovering their motions: A
predictive method for computational prehension. In M. Goodale (ed), Vision and Action:
The Control of Grasping. Norwood, N.J.: Ablex.
Wright, R.D., Dawson, M.R. and Pylyshyn, Z.W. (1987). Spatio-temporal parameters and the
three-dimensionality of apparent motion: Evidence for two types of processing. Spatial
Vision, 2, 263-272.
Wurtz, R.W., and Mohler, C.W. (1976). Organization of Monkey Superior Colliculus: Enhanced
Visual Response of Superficial Layer Cells, J. Neurophysiol. 39:745-765.
Yantis, S. (1988). On analog movements of visual attention. Perception and Psychophysics, in
press.
– 30 –
A Model of Spatial Indexing
Table 1a: Summary of Some Assumptions of the
FINST Model
1. P r i m i t i v e r e t i n t o p i c p r o c e s s e s p r o d u c e f e a t u r e - c l u s t e r s a u t o m a t i c a l l y a n d i n p a r a l l e lacross the retina.
2. C e r t a i n o f t h e s e c l u s t e r s a r e s e l e c t e d o r a c t i v a t e d ( a l s o i n p a r a l l e l ) b a s e d o n t h e i rdistinctiveness within a local neighbourhood (e.g. the so-called “popout” or odd-man-outfeatures). These tend to be feature clusters that are reliably associated with distinct distalor scene features.
3. The activated clusters compete for a finite pool of internal referencing tokens calledFINSTs. This also happens in parallel, and the initial assignment of FINSTs is stimulus-driven. Since the supply of FINSTs is limited, this is a resource-constrained process.
4. The primitive processes that create feature clusters also maintain their integrity: A FINSTthat is bound to a feature cluster keeps being bound to it as the cluster changes its locationcontinuously on the retina. In this way FINSTs “point to” fixed places in a scene withoutidentifying what is being pointed to –- serving like the indexical pronouns “here” or--“there”.
5. Some higher order patterns, consisting of aggregates of primitive feature clusters (e.g.,contours) can also be assigned FINSTs under either top-down or bottom-up control.
6. Only FINSTed feature clusters can enter into subsequent processing: i.e., Relationalproperties like INSIDE(x,y), PART-Of(x,y), ABOVE(x,y), COLLINEAR(x,y,z),… canonly be encoded if features x, y, z,… are FINSTed.
7. There can be some top-down influence in the selection of which activated clusters receiveFINSTs. Higher level processes can, for example, direct a FINST to be placed on certaina l r e a d y a c t i v a t e d f e a t u r e s , d e fi n e d i n t e r m s o f o t h e r F I N S T e d f e a t u r e s ( e . g .INTERSECTION(u:line, v:line), where the two lines are already FINSTed).
– 31 –
A Model of Spatial Indexing
Table 1b: Further assumptions involving themotor system
8. Corresponding to the FINST index in vision, which allows objects to be referred to invisual predicates, there is an index called an ANCHOR that allows objects to be referredto in motor commands. Only objects bound to ANCHORs can appear as arguments tothe MOVE(x,y) command (currently the only motor command assumed).
9. Only two moveable objects are assumed in the minimal version of the model: the FOVEAand the POINTER. These are assumed to always be ANCHORed, so that the pointer canbe moved into the FOVEA and the FOVEA can be moved to the location of the pointer.Hence MOVE(FOVEA, POINTER) and MOVE(POINTER, FOVEA) are assumed to beprimitive operations.
10. There is a primitive operation, called BIND, for cross-binding an element bound to aFINST to one bound to an ANCHOR, thus allowing a cross-reference between indexes indifferent modalities, the first step towards visual-motor coordination. This allows aMOVE command to be issued to the location of a feature that was once on the retina,even after the feature is no longer visible. The system can command either of them o v e a b l e o b je c t s t o m o v e t o t h e l o c a t i o n o f t h e A N C H O R b y u s i n g t h e p r i m i t i v eo p e r a t i o n B I N D ( x : F I N S T , y : A N C H O R ) , f o l l o w e d b y e i t h e r M O V E ( F O V E A ,y:ANCHOR) or MOVE(POINTER, y:ANCHOR).
11. Objects that are bound to an ANCHOR (which always includes the POINTER and theFOVEA) can serve in place of FINSTed features when evaluating perceptual predicates(such as ABOVE(x,y), INSIDE(x,y), and so on) even if they are not on the retina. In
other words these two objects provide a limited means for evaluating spatial relationsamong pairs of places when both places are not visible concurrently.
– 32 –