1
Texture perception
Ruth Rosenholtz
Massachusetts Institute of Technology, Department of Brain and Cognitive Sciences, CSAIL, Cambridge,
MA, USA
To appear in:
Oxford Handbook of Perceptual Organization
Oxford University Press
Edited by Johan Wagemans
Abstract
Texture informs our interpretation of the visual world. It provides a cue to the shape and orientation of
a surface, to segmenting an image into meaningful regions, and to classifying those regions, e.g. in terms
of material properties. This chapter discusses recent advances in understanding of segmentation and
representation of visual texture. Successful models have described texture by a rich set of image
statistics (“stuff”) rather than by the features of discrete, pre-segmented texture elements (“things”).
Texture processing mechanisms may also underlie important phenomena in peripheral vision known as
crowding. If true, such mechanisms would influence the information available for object recognition,
scene perception, and many visual-cognitive tasks.
Keywords: texture, texture segmentation, representation, crowding, image statistics, stuff vs. things
1. Introduction: What is texture?
The structure of a surface, say of a rock, leads to a pattern of bumps and dips which we can feel with our
fingers. This applies equally well to the surface of skin, the paint on the wall, the surface of a carrot, or
the bark of a tree. Similarly, the pattern of blades of grass in a lawn, pebbles on the ground, or fibers in
woven material, all lead to a tactile “texture”. The surface variations that lead to texture we can feel
also tend to lead to variations in the intensity of light reaching our eyes, producing what is known as
“visual texture” (or here, simply “texture”). Visual texture can also come from variations that do not
lend themselves to tactile texture, such as the variation in composition of a rock (quartz looks different
from mica), waves in water, or patterns of surface color such as paint.
Texture is useful for a variety of tasks. It provides a cue to the shape and orientation of a surface (Gibson
1950). It aids in identifying the material of which an object or surface is made (Gibson 1986). Most
obviously relevant for this Handbook, texture similarity provides one cue to perceiving coherent groups
and regions in an image.
Understanding human texture processing requires the ability to synthesize textures with desired
properties. By and large this was intractable before the wide availability of computers. Gibson (1950)
studied shape-from-texture by photographing wallpaper from different angles. Our understanding of
2
texture perception would be quite limited if we were restricted to the small set of textures found in
wallpaper. Attneave (1954) gained significant insight into visual representation by thinking about
perception of a random noise texture, though he had to generate that texture by hand, filling in each
cell according to a table of random numbers. Beck (1966; 1967) formed micropattern textures out of
black tape affixed to white cardboard, restricting the micropatterns to those made of line segments.
Olson and Attneave (1970) had more flexibility, as their micropatterns were drawn in india ink. Julesz
(1962, 1965) was in the enviable position of having access to computers and algorithms for generating
random textures. More recently, texture synthesis techniques have gotten far more powerful, allowing
us to gain new insights into human vision.
It is elucidating to ask why we label the surface variations of tree bark “texture,” and the surface
variations of the eyes, nose, and mouth “parts” of a face object, or objects in their own right. One
reason for the distinction may be that textures have different identity-preserving transformations than
objects. Shifting around regions within a texture does not fundamentally change most textures, whereas
swapping the nose and mouth on a face turns it into a new object (see also Behrmann et al., this
volume). Two pieces of the same tree bark will not look exactly the same, but will seem to be the same
“stuff”, and therefore swapping regions has minimal effect on our perception of the texture. Textures
are relatively homogeneous, in a statistical sense, or at least slowly varying. Fundamentally, texture is
statistical in nature, and one could argue that texture is stuff that is more compactly represented by its
statistics – its aggregate properties – than by the configuration of its parts (Rosenholtz 1999).
That texture and objects have different identity-preserving transformations suggests that one might
want to perform different processing on objects than on texture. In the late 1990s, that was certainly
the case in computer vision and image processing. Object recognition algorithms differed greatly from
texture classification algorithms. Algorithms for determining object shape and pose were very different
from those that found the shape of textured surfaces. In image coding, regions containing texture might
be compressed differently than those dominated by objects (Popat and Picard 1993). The notion of
different processing for textures vs. objects was prevalent enough that several researchers developed
algorithms to find regions of texture in an image, though this was hardly a popular idea (Karu et al. 1996;
Rosenholtz 1999).
However, exciting recent work (Section 4) suggests that human vision employs texture processing
mechanisms even when performing object recognition tasks in image regions not containing obvious
“texture”. The phenomena of visual crowding provided the initial evidence for this hypothesis. However,
if true, such mechanisms would influence the information available for object recognition, scene
perception, and diverse tasks in visual cognition.
This chapter reviews texture segmentation, texture classification/appearance, and visual crowding. It is
obviously impossible to fully cover such a diversity of topics in a short chapter. The material covered will
focus on computational issues, on the representation of texture by the visual system, and on
connections between the different topics.
3
2. Texture segmentation
2.1 Phenomena
An important facet of vision is the ability to perform “perceptual organization,” in which the visual
system quickly and seemingly effortlessly transforms individual feature estimates into perception of
coherent regions, structures, and objects. One cue to perceptual organization is texture similarity. The
visual system uses this cue in addition to and in conjunction with (Giora and Casco 2007; Machilsen and
Wagemans 2011) grouping by proximity, feature similarity, and good continuation (see also Brooks, this
volume; Elder, this volume).
The dual of grouping by similar texture is important in its own right, and has, in fact, received more
attention (see also Dakin, this volume). In “preattentive” or “effortless” texture segmentation two
texture regions quickly and easily segregate – in less than 200 milliseconds. Observers may perceive a
boundary between the two. Figure 1 shows several examples. Like contour integration and perception
of illusory contours, texture segmentation is a classic Gestalt phenomenon. The whole is different than
the sum of its parts (see also Wagemans, this volume), and we perceive region boundaries which are not
literally present in the image (Figure 1abc).
a b c
d e f
Figure 1. Texture segmentation pairs. (a)-(d): Micropattern textures. (a) Easily segments,
and the two textures have different 2nd order pixel statistics; (b) Also segments fairly
easily, yet the textures have the same 2nd order statistics; (c) Different 2nd-order
4
statistics, does not easily segment, yet it is easy to tell apart the two textures; (d)
Neither segments nor is it easy to tell apart the textures. (e,f) Pairs of natural textures.
The pair in (f) is easier to segment, but all 4 textures are clearly different in appearance.
Researchers have taken performance under rapid presentation, often followed by a mask, as meaning
that texture segmentation is preattentive and occurs in early vision (Julesz 1981; Treisman 1985).
However, the evidence for both claims is somewhat questionable. We do not really understand in what
way rapid presentation limits visual processing. Can higher-level processing not continue once the
stimulus is removed? Does fast presentation mean preattentive? (See also Gillebert & Humphreys, this
volume.) Empirical results have given conflicting answers. Mack et al. (1992) showed that texture
segmentation was impaired under conditions of inattention due to the unexpected appearance of a
segmentation display during another task. However, the segmentation boundaries in their stimuli
aligned almost completely with the stimulus for the main task: two lines making up a large “+” sign. This
may have made the segmentation task more difficult. Perhaps judging whether a texture edge occurs at
the same location as an actual line requires attention. Mack et al. (1992) demonstrated good
performance at texture segmentation in a dual-task paradigm. Others (Braun and Sagi 1991; Ben-Av and
Sagi 1995) show similar results for a singleton-detection task they refer to as texture segregation.
Certainly performance with rapid presentation would seem to preclude mechanisms which require serial
processing of the individual micropatterns which make up textures like those in Figure 1a-e.
Some pairs of textures segment easily (Figure 1ab), others with more difficulty (Figure 1c). Some texture
pairs are obviously different, even if they do not lead to a clearly perceived segmentation boundary
(Figure 1d), whereas other texture pairs require a great deal of inspection to tell the difference (Figure
1e). Predicting the difficulty of segmenting any given pair of textures provides an important benchmark
for understanding texture segmentation. Researchers have hoped that such understanding would
provide insight more generally into early vision mechanisms, such as what features are available
preattentively.
2.2 Statistics of pixels
When two textures differ sufficiently in their mean luminance, segmentation occurs (Boring 1945; Julesz
1962). The same seems true for other differences in the luminance histogram (Julesz 1962; Julesz 1965;
Chubb et al. 2007). In other words, a sufficiently large difference between two textures in their 1st-order
luminance statistics leads to effortless segmentation.1 Differences in 1st-order chrominance statistics
also support segmentation (e.g. Julesz 1965).
However, differences in 1st-order pixel statistics are not necessary for texture segmentation to occur.
Differences in line orientation between two textures are as effective as differences in brightness (Beck
1966; Beck 1967; Olson and Attneave 1970). Consider micropattern textures formed of line segments
1 Terminology in the field of texture perception stands in a confused state. “1
st- and 2
nd-order” can refer to (a) 1
st-
order histograms of features vs. 2nd
-order correlations of those features; (b) statistics involving a measurement to
the first power (e.g. the mean) vs. a measurement to the power of 2 (e.g. the variance) – i.e. the 1st
- and 2nd
-
moments from mathematics; or (c) a model with only one filtering stage, vs. a model with a filtering stage, a non-
linearity, and then a 2nd
filtering stage. This chapter uses the first definition.
5
(e.g. Figures 1a-c). Differences in the orientations of the line segments predict segmentation better than
either the orientation of the micropatterns, or their rated similarity. An array of upright Ts segments
poorly from an array rotated by 90 degrees; the line orientations are the same in the two patterns. A T
appears more similar to a tilted (45˚) T than to an L, but Ts segment from tilted-Ts more readily than
they do from Ls.
Julesz (1965) generated textures defined by Markov processes, in which each pixel depends
probabilistically on its predecessors. He observed that one could often see within these textures clusters
of similar brightness values. For example, such clusters might form horizontal stripes, or dark triangles.
Julesz suggested that early perceptual grouping mechanisms might extract these clusters, and that, “As
long as the brightness value, the spatial extent, the orientation and the density of clusters are kept
similar in two patterns, they will be perceived as one”.
It is tempting to observe clusters in Julesz’ examples and conclude that extraction of “texture elements”
(a.k.a. texels), underlies texture perception. However, texture perception might also be mediated by
measurement of image statistics, with no intermediate step of identifying clusters. The stripes and
clusters in Julesz’ examples were, after all, produced by random processes. As Julesz (1975) put it, “*10
years ago,] I was skeptical of statistical considerations in texture discrimination because I did not see
how clusters of similar adjacent dots, which are basic for texture perception, could be controlled and
analyzed by known statistical methods… In the intervening decade much work went into finding
statistical methods that would influence cluster formation in desirable ways. The investigation led to
some mathematical insights and to the generation of some interesting textures.”
The key, for Julesz, was to figure out how to generate textures with desired clusters of dark and light
dots, while controlling their image statistics. With the help of collaborators Gilbert, Shepp, and Frisch
(acknowledged in Julesz 1975), Julesz proposed simple algorithms for generating pairs of micropattern
textures with the same 1st- and 2nd-order pixel statistics. For Julesz’ black and white textures, 1st-order
statistics reduce to the fraction of black dots making up the texture. 2nd-order or dipole statistics can be
measured by dropping “needles” onto a texture, and observing the frequency with which both ends of
the needle land on a black dot, as a function of needle length and orientation. Such 2nd-order statistics
are equivalent to the power spectrum.
Examination of texture pairs sharing 1st- and 2nd-order pixel statistics led to the now-famous “Julesz
conjecture”: “Whereas textures that differ in their first- and second-order statistics can be discriminated
from each other, those that differ in their third- or higher-order statistics usually cannot” (Julesz 1975).
This theory predicted a number of results, for both random noise and micropattern-based textures. For
instance, the textures in Figure 1a differ in their 2nd-order statistics, and readily segment, whereas the
textures in Figure 1d share 2nd-order statistics, and do not easily segment.
2.3 Statistics of textons
However, researchers soon found counterexamples to the Julesz conjecture (Caelli and Julesz 1978;
Caelli et al 1978; Julesz et al 1978; Victor and Brodie 1978). For example, the Δ texture pair (Figure 1b)
is relatively easy to segment, yet the two textures have the same 2nd-order statistics. A difference in 2nd-
order pixel statistics appeared neither necessary nor sufficient for texture segmentation.
6
Based on the importance of line orientation in texture segmentation (Beck 1966; Beck 1967; Olson and
Attneave 1970), two new classes of theories emerged. The first suggested that texture segmentation
was mediated not by 2nd-order pixel statistics, but rather by 1st-order statistics of basic stimulus features
such as orientation and size (Beck, Prazdny, and Rosenfeld, 1983). Here “1st-order” refers to histograms
of, e.g., orientation, instead of pixel values.
But what of the Δ texture pair? By construction, it contained no difference in the 1st-order statistics of
line orientation. However, notably triangles are closed shapes, whereas arrows are not. Perhaps
emergent features (Pomerantz & Cragin, this volume), like closure, also matter in texture segmentation.
Other iso-2nd order pairs hinted at the relevance of additional higher-level features, dubbed textons.
Texton theory proposes that segmentation depends upon 1st-order statistics not only of basic features
like orientation, but also of textons such as curvature, line endpoints, and junctions (Julesz 1981; Bergen
and Julesz 1983).
While intuitive on the surface, this explanation was somewhat unsatisfying. Proponents were vague
about the set of textons, making the theory difficult to test or falsify. In addition, it was not obvious how
to extract textons, particularly for natural images (Figure 1ef). (Though see Barth et al. (1998), for both a
principled definition of a class of textons, and a way to measure them in arbitrary images.) Texton
theories have typically been based on verbal descriptions of image features rather than actual
measurements (Bergen and Adelson 1988). These “word models” effectively operate on “things” like
“closure” and “arrow junctions” which a human experimenter has labeled (Adelson 2001).
2.4 Image processing-based models
By contrast, another class of “image computable” theories emerged. These models are based on simple
image processing operations (Knutsson and Granlund 1983; Caelli 1985; Turner 1986; Bergen and
Adelson 1988; Sutter et al. 1989; Fogel and Sagi 1989; Bovik et al. 1990; Malik and Perona 1990; Bergen
and Landy 1991; Rosenholtz 2000). According to these theories, texture segmentation arises as an
outcome of mechanisms like those known to exist in early vision.
These models have similar structure: a first linear filtering stage, followed by a non-linear operator,
additional filtering, and a decision stage. They have been termed filter-rectify-filter (e.g. Dakin et al.
1999), or linear-nonlinear-linear (LNL, Landy and Graham 2004) models. Chubb and Landy (1991)
dubbed the basic structure the “back-pocket model”, as it was the model many researchers would “pull
out of their back pocket” to explain segmentation phenomena.
The first stage typically involves multiscale filters, both oriented and unoriented. The stage-two non-
linearity might be a simple squaring, rectification, or energy computation (Knutsson and Granlund 1983;
Turner 1986; Sutter et al. 1989; Bergen and Adelson 1988; Fogel and Sagi 1989; Bovik et al. 1990),
contrast normalization (Landy and Bergen 1991; Rosenholtz 2000), or inhibition and excitation between
neighboring channels and locations (Caelli 1985; Malik and Perona 1990). The final filtering and decision
stages often act as a coarse-scale edge detector. Much effort has gone into uncovering the details of the
filters and nonlinearities.
As LNL models employ oriented filters, they naturally predict segmentation of textures that differ in
their component orientations. But what about results thought to require more complex texton
operators? Bergen and Adelson (1988) examined segmentation of an XL texture pair like that in Figure
7
1c. These textures contain the same distribution of line orientations, and Bergen and Julesz (1983) had
suggested that easy segmentation might be mediated by such features as terminators and X- vs. L-
junctions. Bergen and Adelson (1988) demonstrated the feasibility of a simpler solution, based on low-
level mechanisms. They observed that the Xs appear smaller than the Ls, even though their component
lines are the same length. Beck (1967) similarly observed that Xs and Ls have a different overall
distribution of brightness when viewed out of focus. Bergen and Adelson demonstrated that if one
accentuates the difference in size, by increasing the length of the Ls’ bars (while compensating the bar
intensities so as not to make one texture brighter than the other), segmentation gets easier. Decrease
the length of the Ls’ bars, and segmentation becomes quite difficult. Furthermore, they showed that in
the original stimulus, a simple size-tuned mechanism – center-surround filtering followed by full-wave
rectification – responds more strongly to one texture than the other. Even though our visual systems can
ultimately identify nameable features like terminators and junctions, those features may not underlie
texture segmentation, which may involve lower-level mechanisms.
The LNL models naturally lend themselves to implementation. Nearly all the models cited here (Section
2.4) were implemented at least up to the decision stage. They operate on arbitrary images.
Implementation makes these models testable and falsifiable, in stark contrast to word models operating
on labeled “things” like micropatterns and their features. Furthermore, the LNL models have performed
reasonably well. Malik and Perona's (1990) model, one of the most fully specified and successful, made
testable predictions of segmentation difficulty for a number of pairs of micropattern textures. They
found strong agreement between their model’s predictions and behavioral results of Kröse (1986) and
Gurnsey and Browse (1987). They also produced meaningful results on a complex piece of abstract art.
Image computable models naturally make testable predictions about the effects of texture density
(Rubenstein and Sagi 1996) alignment, and sign of contrast (Graham et al. 1992; Beck et al. 1987), for
which word models inherently have trouble making predictions.
2.5 Bringing together statistical and image processing-based models
Is texture segmentation, then, a mere artifact of early visual processing, rather than a meaningful
indicator of statistical differences between textures? The visual system should identify boundaries in an
intelligent way, not leave their detection to the caprices of early vision. Making intelligent decisions in
the face of uncertainty is the realm of statistics. Furthermore, statistical models seem appropriate due
to the statistical nature of textures.
Statistical and image processing-based theories are not mutually exclusive. Arguably the first filtering
stage in LNL models extracts basic features, and the later filtering stage computes a sort of average.
Perhaps thinking in terms of intelligent decisions can clarify the role of unknown parameters in the LNL
models, better specify the decision process, and lend intuitions about which textures segment.
If the mean orientations of two textures differ, should we necessarily perceive a boundary? From a
decision-theory point of view this would be unwise; a small difference in mean might occur by chance.
Perhaps textures segment if their 1st-order feature statistics are significantly different (Voorhees and
Poggio 1988; Puzicha et al. 1997; Rosenholtz 2000). Significant difference takes into account the
variability of the textures; two homogeneous textures with mean orientations differing by 30 degrees
may segment, while two heterogeneous textures with the same difference in mean may not.
Experimental results confirm that texture segmentation shows this dependence upon texture variability.
8
Observers can also segment two textures differing significantly in the variance of their orientations.
However, observers are poor at segmenting two textures with the same mean and variance, when one is
unimodal and the other bimodal (Rosenholtz 2000). It seems that observers do not use the full 1st-order
statistics of orientation.
These results point to the following model of texture segmentation (Rosenholtz 2000). The observer
collects n noisy feature estimates from each side of a hypothesized edge. The number of samples is
limited, as texture segmentation involves local rather than global statistics (Nothdurft 1991). If the two
sets of samples differ significantly, with some confidence, α, then the observer sees a boundary.
Rosenholtz (2000) tests for a significant difference in mean orientation, mean contrast, orientation
variance, and contrast variance.
The model can be implemented using biologically plausible image processing operations. Though the
theoretical development came from thinking about statistical tests on discrete samples, the model
extracts no “things” like line elements or texels. Rather it operates on continuous “stuff” (Adelson 2001).
The model has three fairly intuitive free parameters, all of which can be determined by fitting behavioral
data. Two internal noise parameters capture human contrast and orientation discriminability. The last
parameter specifies the radius of the region over which measurements are pooled to compute the
necessary summary statistics (mean, variance, etc.).
Human performance segmenting orientation-defined textures is well fit by the model (Rosenholtz 2000).
The model also predicts the rank ordering of segmentation strength for micropattern texture pairs (TL,
+T, Δ, and L+) found by Gurnsey and Browse (1987). Furthermore, Hindi Attar et al. (2007) related the
salience of a texture boundary to the rate of filling-in of the central texture in stabilized images. They
found that the model predicted many of the asymmetries found in filling-in.
The visual system may do something intelligent, like a statistical test (Voorhees and Poggio 1988;
Puzicha et al. 1997; Rosenholtz 2000), or Bayesian inference (Lee 1995, Feldman, this volume, Chapter
54 on Bayesian models), when detecting texture boundaries within an image. These decisions can be
implemented using biologically plausible image processing operations, thus bringing together image
processing-based and statistical models of texture segmentation.
3. Texture perception more broadly
Decisions based upon a few summary statistics do a surprisingly good job of predicting existing texture
segmentation phenomena. Are these few statistics all that is required for texture perception more
broadly? This seems unlikely. First, they perhaps do not even suffice to explain texture segmentation.
Simple contrast energy has probably worked in place of more complex features only because we have
tested a very limited a set of textures (Barth et al. 1998).
Second, consider Figure 1a-d. The mean and variance of contrast and orientation do little to capture the
appearance of the component texels, yet we have a rich percept of their shapes and arrangement. What
measurements, then, might human vision use to represent textures?
9
Much of the early work in texture classification and discrimination came from computer vision. It aimed
at distinguishing between textured regions in satellite imagery, microscopy, and medical imagery. As
with texture segmentation, early research pinpointed 2nd-order statistics, particularly the power
spectrum, as a possible representation (Bajcsy 1973). Researchers also explored Markov Random Field
representations more broadly. For practical applications, power spectrum and related measures worked
reasonably well. (For a review, see Haralick 1979, and Wechsler 1980.)
However, the power spectrum cannot predict texture segmentation, and texture appearance likely
requires more information rather than less. Furthermore, texture classification provides a weak test.
Performance is highly dependent upon both the diversity of textures in the dataset and the choice of
texture categories. A texture analysis/synthesis method better enables us to get a sense of the
information encoded by a given representation (Tomita et al. 1982; Portilla and Simoncelli 2000).
Texture analysis/synthesis techniques measure a descriptor for a texture, and then generate new
samples of texture which share the same descriptor. Rather than simply synthesizing a texture with
given properties, they can measure those properties from an arbitrary input texture. The “analysis”
stage makes the techniques applicable to a far broader array of textures. Most of the progress in
developing models of human texture representation has been made using texture analysis/synthesis
strategies.
One can easily get a sense of the information encoded by the power spectrum by generating a new
image with the same Fourier transform magnitude, but random phase. This representation is clearly
inadequate to capture the appearance (Figure 2). The synthesized texture in Figure 2b looks like filtered
noise (because it is), rather than like the peas in Figure 2a. The synthesized texture has none of the
edges, contours, or other locally oriented structures of a natural image. Natural images are highly non-
Gaussian (Zetzsche et al 1993). The responses of oriented bandpass filters applied to natural scenes are
kurtotic (sparse) and highly dependent; these statistics cannot be captured by the power spectrum
alone, and are responsible for important aspects of the appearance of natural images (Simoncelli and
Olshausen 2001).
a b c d
Figure 2. Comparison of the information encoded in different texture descriptors. (a)
Original peas image; (b) Texture synthesized to have the same power spectrum as (a),
but random phase. This representation cannot capture the structures visible in many
natural and artificial textures, though it performs adequately for some textures such as
the left side of Figure 1e. (c) Marginal statistics of multiscale, oriented and non-oriented
filter banks better capture the nature of edges in natural images (Heeger and Bergen
10
1995) (d) Joint statistics work even better at capturing structure (Portilla and Simoncelli
2000).
Due to limitations of the power spectrum and related measures, researchers feared that statistical
descriptors could not adequately capture the appearance of textures formed of discrete elements, or
containing complex structures (Tomita et al. 1982). Some researchers abandoned purely statistical
descriptors in favor of more “structural” approaches, which described texture in terms of discrete texels
and their placement rule (Tomita et al. 1982; Zucker 1976; Haralick 1979). Implicitly, structural
approaches assume that texture processing occurs at later stages of vision, “a cognitive rather than a
perceptual approach” (Wechsler 1980). Some researchers suggested choosing between statistical and
structural approaches, depending upon the kind of texture (Zucker 1976; Haralick 1979).
Structural models were less than successful, largely due to difficulty extracting texels. This worked
better when texels were allowed to consist of arbitrary image regions, rather than correspond to
recognizable “things” (e.g. Leung and Malik 1996).
The parallels to texture segmentation should be obvious: researchers rightly skeptical about the power
of simple statistical models abandoned them in favor of models operating on discrete “things”. As with
texture segmentation, the lack of faith in statistical models proved unfounded. Sufficiently rich statistical
models can capture a lot of structure. Demonstrating this requires more complex texture synthesis
methodologies to find samples of texture with the same statistics. A number of texture synthesis
techniques have been developed, with a range of proposed descriptors.
Heeger and Bergen's (1995) descriptor, motivated by the success of the LNL segmentation models,
consists of marginal (i.e. 1st-order) statistics of the outputs of multiscale filters, both oriented and
unoriented. Their algorithm synthesizes new samples of texture by beginning with an arbitrary image
“seed” – often a sample of random noise, though this is not required – and iteratively applying
constraints derived from the measured statistics. After a number of iterations, the result is a new image
with (approximately) the same 1st-order statistics as the original. Figure 2c shows an example. Their
descriptor captures significantly more structure than the power spectrum; enough to reproduce the
general size of the peas and their dimples. It still does not quite get the edges right, and misrepresents
larger-scale structures.
Portilla and Simoncelli (2000) extended the Heeger/Bergen methodology, and included in their texture
descriptor the joint (2nd-order) statistics of responses of multiscale V1-like simple and complex “cells”.
Figure 2d shows an example synthesis. This representation captures much of the perceived structure,
even in micropattern textures (Portilla and Simoncelli 2000; Balas 2006), though it is not perfect. Some
non-parametric synthesis techniques have performed better at producing new textures which look like
the original (e.g. Efros and Leung 1999). However, these techniques use a texture descriptor which is
essentially the entire original image. It is unclear how biologically plausible such a representation might
be, or what the success of such techniques teach us about human texture perception.
Portilla and Simoncelli (2000), then, remains a state-of-the-art parametric texture model. This does not
imply that its measurements are literally those made by the visual system, though they are certainly
biologically plausible. A “rotation” of the texture space would maintain the same information while
11
changing the representation dramatically. Furthermore, a sufficiently rich set of 1st-order statistics can
encode the same information as higher-order statistics (Zhu et al 1996). However, the success of Portilla
and Simoncelli’s model demonstrates that a rich and high-dimensional set of image statistics comes
close to capturing the information preserved and lost in visual representation of a texture.
4. Texture perception is not just for textures
Researchers have long studied texture perception in the hope that it would lend insight into vision more
generally. Texture segmentation, rather than merely informing us about perceptual organization, might
uncover the basic features available preattentively (Treisman 1985), or the nature of early nonlinearities
in visual processing (Malik and Perona 1990; Graham et al. 1992; Landy and Graham 2004). However,
common wisdom assumed that after the measurement of basic features, texture and object perception
mechanisms diverged (Cant and Goodale 2007). Similarly, work in computer vision assumed separate
processing for texture vs. objects.
More recent work blurs the distinction between texture and object processing. Modern computer vision
treats them much more similarly. Recent human vision research demonstrates that “texture processing”
operations underlie vision more generally. The field’s previous successes in understanding texture
perception may elucidate visual processing for a broad array of tasks.
4.1 Peripheral crowding
Texture processing mechanisms have been associated with visual search (Treisman 1985) and set
perception (Chong and Treisman 2003). One can argue that texture statistics naturally inform these
tasks. Evidence of more general texture processing in vision has come from the study of peripheral
vision, in particular visual crowding.
Peripheral vision is substantially worse than foveal vision. For instance, the eye trades off sparse
sampling over a wide area in the periphery for sharp, high resolution vision over a narrow fovea. If we
need finer detail, we move our eyes to bring the fovea to the desired location.
The phenomenon of visual crowding2 illustrates that loss of information in the periphery is not merely
due to reduced acuity. A target such as the letter ‘A’ is easily identified when presented in the periphery
on its own, but becomes difficult to recognize when flanked too closely by other stimuli, as in the string
of letters, ‘BOARD’. An observer might see these crowded letters in the wrong order, perhaps confusing
the word with ‘BORAD’. They might not see an ‘A’ at all, or might see strange letter-like shapes made up
of a mixture of parts from several letters (Lettvin 1976).
2 “Crowding” is used inconsistently and confusingly in the field, sometimes as a transitive verb (“the flankers
crowd the target”), sometimes as a mechanism, and sometimes as the experimental outcome in which recognizing
a target is impaired in the presence of nearby flankers. This chapter predominantly follows the last definition,
though in describing stimuli sometimes refers to the lay “at lot of stuff in a small space.”
12
Crowding occurs with a broad range of stimuli (see Pelli and Tillman 2008 for a review). However, not all
flankers are equal. When the target and flankers are dissimilar or less grouped together, target
recognition is easier (Andriessen and Bouma 1976; Kooi et al 1994; Saarela et al. 2009). Strong grouping
among the flankers can also make recognition easier (Livne and Sagi 2007; Sayim et al 2010; Manassi et
al. 2012). Furthermore, crowding need not involve discrete “target” and “flankers”; Martelli et al. (2005)
argue that “self-crowding” occurs in peripheral perception of complex objects and scenes.
4.2 Texture processing in peripheral vision?
The percept of a crowded letter array contains sharp, letter-like forms, yet they seem lost in a jumble, as
if each letter’s features (e.g., vertical bars and rounded curves) have come untethered and been
incorrectly bound to the features of neighboring letters (Pelli et al. 2004). Researchers have associated
the phenomena of crowding with the “distorted vision” of strabismic amblyopia (Hess 1982). Lettvin
(1976) observed that an isolated letter in the periphery seems to have characteristics which the same
letter, flanked, does not. The crowded letter “only seems to have a ‘statistical’ existence.” In line with
these subjective impressions, researchers have proposed that crowding phenomena result from “forced
texture processing,” involving excessive feature integration (Pelli et al. 2004), or compulsory averaging
(Parkes et al. 2001) over each local pooling region. Pooling region size grows linearly with eccentricity,
i.e. with distance to the point of fixation (Bouma 1970).
Assume for the sake of argument – following Occam’s razor – that the peripheral mechanisms
underlying crowding operate all the time, by default; no mechanism perversely “switches on” to thwart
our recognition of flanked objects. This Default Processing assumption has profound implications for
vision. Peripheral vision is hugely important; very little processing truly occurs in the fovea. One can
easily recognize the cat in Figure 3, when fixating on the “+”. Yet the cat may extend a number of
degrees beyond the fovea. Could object recognition, perceptual organization, scene recognition, face
recognition, navigation, and guidance of eye movements all share an early, local texture processing
mechanism? Is it that “texture is primitive and textures combine to produce forms” (Lettvin 1976)? This
seems antithetical to ideas of different processing for textures and objects. Prior to 2000, it would have
seemed surprising to use a texture-like representation for more general visual tasks.
13
a b
c d
e f g h
Figure 3. Original images (a,c) and images synthesized to have approximately the same
local summary statistics (b,d). Intended (and model) fixation on the “+”. The cat can
clearly be recognized while fixating, even though much of the object falls outside the
fovea. The summary statistics contain sufficient information to capture much of its
appearance (b). Similarly, the summary statistics contain sufficient information to
14
recognize the gist of the scene (d), though perhaps not to correctly assess its details. (e)
A patch of search display, containing a tilted target and vertical distractors. (f) The
summary statistics (here, in a single pooling region) are sufficient to decipher the
approximate number of items, much about their appearance, and the presence of the
target. (g) A target-absent patch from search for a white vertical among black vertical
and white horizontal. (h) The summary statistics are ambiguous about the presence of a
white vertical, perhaps leading to perception of illusory conjunctions. (c-h) originally
appeared in (Rosenholtz, Huang, et al. 2012).
However, several state-of-the-art computer vision techniques operate upon local texture-like image
descriptors, even when performing object and scene recognition. The image descriptors include local
histograms of gradient directions, and local mean response to oriented multi-scale filters, among others
(Bosch et al 2006, 2007; Dalal and Triggs, 2005; Oliva and Torralba 2006; Tola et al. 2010; Fei-Fei and
Perona 2005). Such texture descriptors have proven effective for detection of humans in natural
environments (Dalal and Triggs, 2005), object recognition in natural scenes (Bosch et al, 2007; Mutch
and Lowe, 2008; Zhu, Bichot, and Chen, 2011), scene classification (Oliva and Torralba 2001; Renninger
and Malik 2004; Fei-Fei and Perona 2005), wide-baseline stereo (Tola et al. 2010), gender discrimination
(Wang, Yau, and Sung, 2010), and face recognition (Velardo and Dugelay, 2010). These results represent
only a handful of hundreds of recent computer vision papers utilizing similar methods.
Suppose we take literally the idea that peripheral vision involves early local texture processing. The key
questions are whether on the one hand, humans make the sorts of errors one would expect, and on the
other hand whether texture processing preserves enough information to explain the successes of vision,
such as object and scene recognition.
A local texture representation predicts vision would be locally ambiguous in terms of the phase and
location of features, as texture statistics contains such ambiguities. Do we see evidence in vision? In fact,
we do. Observers have difficulty distinguishing 180 degree phase differences in compound sine wave
gratings in the periphery (Bennett and Banks 1991; Rentschler and Treutwein 1985) and show marked
position uncertainty in a bisection task (Levi and Klein 1986). Furthermore, such ambiguities appear to
exist during object and scene processing, though we rarely have the opportunity to be aware of them.
Peripheral vision tolerates considerable image variation without giving us much sense that something is
wrong (Freeman and Simoncelli 2011; Koenderink et al. 2012). Koenderink et al. (2012) apply a spatial
warping to an ordinary image. It is surprisingly difficult to tell that anything is wrong, unless one fixates
near the image. (See http://i-perception.perceptionweb.com/fulltext/i03/i0490sas.)
To go beyond qualitative evidence, we need a concrete proposal for what “texture processing” means.
This chapter has reviewed much of the relevant work. Texture appearance models aim to understand
texture processing in general, whereas segmentation models attempt only to predict grouping. Our
current best guess as to a model of texture appearance is that of Portilla and Simoncelli (2000). Perhaps
the visual system computes something like 2nd-order statistics of the responses of V1-like cells, over
each local pooling region. We call this the Texture Tiling Model. This proposal (Balas et al. 2009;
Freeman and Simoncelli 2011) is not so different from standard object recognition models, in which
later stages compute more complex features by measuring co-occurrences of features from the previous
15
layer (Fukushima 1980; Riesenhuber and Poggio 1999). Second-order correlations are essentially co-
occurrences pooled over a substantially larger area.
Can this representation predict crowded object recognition? Balas et al (2009) demonstrate that its
inherent confusions and ambiguities predict difficulty recognizing crowded peripheral letters.
Rosenholtz, Raj, et al. (2012) further show that this model predicts crowding of other simple symbols.
Visual search employs wide field-of-view, crowded displays. Is the difference between easy and difficult
search due to local texture processing? We can utilize texture synthesis techniques to visualize the local
information available (Figure 3). When target and distractor bars differ significantly in orientation, the
statistics are sufficient to identify a crowded peripheral target. The model predicts easy “popout”
search. The model also predicts the phenomenon of illusory conjunctions, and other classic search
results (Rosenholtz, Huang, et al. 2012; Rosenholtz, Raj, et al. 2012). Characterizing visual search as
limited by peripheral processing represents a significant departure from earlier interpretations which
attributed performance to the limits of processing in the absence of covert attention (Treisman 1985).
Under the Default Processing assumption, we must also ask whether texture processing might underlie
normal object and scene recognition. We synthesized an image to have the same local summary
statistics as the original (Rosenholtz 2011; Rosenholtz, Huang, et al. 2012; see also Freeman and
Simoncelli 2011). A fixated object (Figure 3b) is clearly recognizable; it is quite well encoded by this
representation. Glancing at a scene (Figure 3d), much information is available to deduce the gist and
guide eye movements; however, precise details are lost, perhaps leading to change blindness (Oliva and
Torralba 2006; Freeman and Simoncelli 2011; Rosenholtz, Huang, et al. 2012).
These results and demos indicate the power of the Texture Tiling Model. It is image computable, and
can make testable predictions for arbitrary stimuli. It predicts on the one hand difficulties of vision, such
as crowded object recognition and hard visual search, while plausibly supporting normal object and
scene recognition.
4.3 Parallels between alternative models of crowding and less successful texture models
It is instructive to consider alternative models of crowding, and their parallels to previous work on
texture perception. A number of crowding experiments have been designed to test an overly simple
texture processing model. In this “simple pooling” or “faulty-integration” model, each pooling region
yields the mean of some (often unspecified) feature. To a first approximation, this model predicts worse
performance the more one fills up the pooling region with irrelevant flankers, as doing so reduces the
informativeness of the mean. This impoverished model cannot explain improved performance with
larger flankers (Levi and Carney 2009; Manassi et al. 2012), or when flankers group with one another
(Saarela et al. 2009; Manassi et al. 2012).
Partially in response to failures of the simple pooling model, researchers have suggested that some
grouping might occur prior to the mechanisms underlying crowding (Saarela et al. 2009). More
generally, the field tends to describe crowding mechanisms as operating on “things”: Levi and Carney
(2009) suggested that a key determinant of whether crowding occurs is the distance between target and
flanker centroids. Averaging might operate on discrete features of objects within the pooling region
(Parkes et al. 2001; Greenwood et al. 2009; Põder and Wagemans 2007; Greenwood et al. 2012), and/or
16
localization of those discrete features might be poor (Strasburger 2005; van den Berg et al. 2012). Some
crowding effects seem to depend upon target/flanker identities rather than their features (Louie et al.
2007; Dakin et al. 2010), suggesting that they may be due to later, object-level mechanisms. Though as
Dakin et al. (2010) demonstrate, these apparently “object-centered” effects can be explained by lower-
level mechanisms.
This sketch of alternative models should sound familiar. That crowding mechanisms might act after early
operations have split the input into local groups or objects should have obvious parallels to theories of
texture perception. Once again, a too-simple “stuff” model has been rejected in favor of models which
operate on “things”. These models, typically word models, do not easily make testable predictions for
novel stimuli.
4.4 The power of pooling in high dimensions
A “simple pooling model” bears little resemblance to successful texture descriptors. Texture perception
requires a high dimensional representation. The Portilla and Simoncelli (2000) texture model computes
700-1000 image statistics per texture (depending upon choice of parameters). (The Texture Tiling Model
computes this many statistics per local pooling region.) The “forced texture perception” presumed to
underlie crowding must also be high dimensional – after all, it must at the very least support perception
of actual textures.
Unfortunately it is difficult in general to get intuitions about behavior of high-dimensional models. Low-
dimensional models do not simply scale up to higher dimensions. A single mean feature value captures
little information about a stimulus. Additional statistics provide an increasingly good representation of
the original patch. Stuff-models, if sufficiently rich, can in fact capture a great deal of information about
the visual input.
How well a stimulus can be encoded depends upon its complexity relative to the representation. Flanker
grouping can theoretically simplify the stimulus, leading to better representation and perhaps better
performance. In some cases the information preserved is insufficient to perform a given task, and in
common parlance the stimulus is “crowded.” In other cases, the information is sufficient for the task,
predicting the “relief from crowding” accompanying, for example, a dissimilar target and flankers (e.g.
Rosenholtz, Raj, et al. 2012 and Figure 3e-h).
A high-dimensional representation can also preserve the information necessary to individuate “things”.
For instance, it can capture the approximate number of discrete objects in Figure 3eg. In fact, one can
represent an arbitrary amount of structure in the input by varying the size of the regions over which
statistics are computed (Koenderink and van Doorn 2000), and the set of statistics. The
structural/statistical distinction is not a dichotomy, but rather a continuum.
The mechanisms underlying crowding may be “later” than texture perception mechanisms, and operate
on precomputed groups or “things”. However, just because we often recognize “things” in our stimuli,
as a result of the full visual-cognitive machinery, does not mean that our visual systems operate upon
those things to perform a given task. One should not underestimate the power of high-dimensional
models which operate on continuous “stuff”. In texture perception, such models have explained results
for a wider variety of stimuli, and with arguably simpler mechanisms.
17
5. Conclusions
In the last several decades, much progress has been made toward better understanding the mechanisms
underlying texture segmentation, classification, and appearance. There exists a rich body of work on
texture segmentation, both behavioral experiments and modeling. Many results can be explained by
intelligent decisions based on some fairly simple image statistics. Researchers have also developed
powerful models of texture appearance. More recent work demonstrates that similar texture-processing
mechanisms may account for the phenomena of visual crowding. The details remain to be worked out,
but if true, the visual system may employ local texture processing throughout the visual field. This
predicts that, rather than being relegated to a narrow set of tasks and stimuli, texture processing
underlies visual processing in general, supporting such diverse tasks as visual search, object and scene
recognition.
18
6. References
Adelson, E. H. (2001). On seeing stuff: The perception of materials by humans and machines. In B. E.
Rogowitz and T. N. Pappas (Eds.), Proceedings of the SPIE: HVEI VI, Vol. 4299. 1–12.
Andriessen, J. J., and Bouma, H. (1976) Eccentric vision: Adverse interactions between line segments.
Vision Research, 16, 71-78.
Attneave, F. (1954). Some informational aspects of visual perception. Psychological Review, 61(3), 183-
193.
Bajcsy, R. (1973). Computer identification of visual surfaces. Computer Graphics and Image Processing,
2(2), 118–130.
Balas, B. J. (2006). Texture synthesis and perception: using computational models to study texture
representations in the human visual system. Vision research, 46(3), 299–309.
Balas, B., Nakano, L. and Rosenholtz, R. (2009). A summary-statistic representation in peripheral vision
explains visual crowding. Journal of vision, 9(12), 1–18.
Barth, E., Zetzsche, C. and Rentschler, I. (1998). Intrinsic two-dimensional features as textons. Journal of
the Optical Society of America. A, Optics, image science, and vision, 15(7), 1723–1732.
Beck, J. (1966). Effect of orientation and of shape similarity on perceptual grouping. Perception &
psychophysics, 1(1), 300–302.
Beck, J. (1967). Perceptual grouping produced by line figures. Perception & Psychophysics, 2(11), 491–
495.
Beck, J., Sutter, A. and Ivry, R. (1987). Spatial frequency channels and perceptual grouping in texture
segregation. Computer Vision, Graphics, and Image Processing, 37(2), 299–325.
Beck, J., Prazdny, K., and Rosenfeld, A. (1983). A theory of textural segmentation. In J. Beck, B. Hope, and
A. Rosenfeld (Eds.), Human and machine vision, pp. 1-38. New York: Academic Press.
Ben-av, M. B. and Sagi, Dov (1995). Perceptual grouping by similarity and proximity: Experimental results
can be predicted by intensity autocorrelations. Vision Research, 35(6), 853–866.
Bennett, P. J. and Banks, M. S. (1991). The effects of contrast, spatial scale, and orientation on foveal
and peripheral phase discrimination. Vision Research, 31(10), 1759–1786.
Bergen, J. R. and Adelson, E. H. (1988). Early vision and texture perception. Nature, 333(6171), 363–364.
Bergen, J. R. and Julesz, B. (1983). Parallel versus serial processing in rapid pattern discrimination.
Nature, 303(5919), 696–698.
Bergen, J. R. and Landy, M. S. Computational modeling of visual texture segregation. In M. S. Landy and
J. A. Movshon (Eds.), Computational models of visual perception, pp. 253-271. Cambridge, MA: MIT
Press.
Boring, E.G. (1945). Color and camouflage. In E.G. Boring (Ed.), Psychology for the armed services, pp. 63-
96. Washington, D.C: The Infantry Journal.
Bosch, A., Zisserman, A., and Munoz, X. (2006). Scene classification via pLSA. In Proc. 9th European
Conference on Computer Vision (ECCV'06), Springer Lecture Notes in Computer Science 3954: 517-
530.
Bosch, A., Zisserman, A., and Munoz, X. (2007). Image classification using random forests and ferns.
Proc. 11th International Conference on Computer Vision (ICCV'07) (Rio de Janeiro, Brazil): 1-8.
Bouma, H. (1970). Interaction effects in parafoveal letter recognition. Nature, 226, 177–178.
Bovik, A.C., Clark, M. and Geisler, W.S. (1990). Multichannel Texture Analysis Using Localized Spatial
Filters. IEEE transactions on pattern analysis and machine intelligence, 12(1), 55–73.
19
Braun, J. and Sagi, Dov (1991). Texture-based tasks are little affected by second tasks requiring
peripheral or central attentive fixation. Perception, 20, 483–500.
Caelli, T. (1985). Three processing characteristics of visual texture segmentation. Spatial Vision, 1(1), 19–
30.
Caelli, T. M. and Julesz, B. (1978). On perceptual analyzers underlying visual texture discrimination: Part
I. Biol. Cybernetics, 28, 167-175.
Caelli, T. M., Julesz, B., and Gilbert, E. N. (1978). On perceptual analyzers underlying visual texture
discrimination: Part II. Biol. Cybernetics, 29, 201-214.
Cant, J. S. and Goodale, M. A. (2007). Attention to form or surface properties modulates different
regions of human occipitotemporal cortex. Cerebral Cortex, 17, 713-731.
Chong, S. C. and Treisman, A. (2003). Representation of statistical properties. Vision research, 43, 393-
404.
Chubb, C. and Landy, M. S. (1991). Orthogonal distribution analysis: A new approach to the study of
texture perception. In M. S. Landy and J. A. Movshon (Eds.), Computational Models of Visual
Processing, pp. 291-301. Cambridge, MA: MIT Press.
Chubb, C., Nam, J.-H., Bindman, D. R., and Sperling, G. (2007). The three dimensions of human visual
sensitivity to first-order contrast statistics. Vision research, 47(17), 2237–2248.
Dakin, S. C., Williams, C. B. and Hess, R. F. (1999). The interaction of first- and second-order cues to
orientation. Vision research, 39(17), 2867–2884.
Dakin, S. C., Cass, J., Greenwood, J. A., and Bex, P. J. (2010). Probabilistic, positional averaging predicts
object-level crowding effects with letter-like stimuli. Journal of Vision, 10(10), 1–16.
Dalal, N., and Triggs, B. (2005). Histograms of oriented gradients for human detection. In 2005 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, (CVPR ’05), 886-893.
Efros, A. A., and Leung, T. K. (1999). Texture synthesis by non-parametric sampling. In Proc. Seventh IEEE
International Conference on Computer Vision, 2, 1033-1038.
Fei-Fei, L. and Perona, P. (2005). A Bayesian Hierarchical Model for Learning Natural Scene Categories.
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 2,
524–531.
Fogel, I. and Sagi, D. (1989). Gabor filters as texture discriminator. Biological Cybernetics, 61, 103–113.
Freeman, J. and Simoncelli, E. P. (2011). Metamers of the ventral stream. Nature neuroscience, 14(9),
1195–1201.
Fukushima, K. (1980). Neocognitron: a self-organizing neural network model for a mechanism of pattern
recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202.
Gibson, J. (1950). The perception of visual surfaces. The American journal of psychology, 63(3), 367–384.
Gibson, J. J. (1986). The ecological approach to visual perception. Hillsdale, NJ: Lawrence Erlbaum
Associates.
Giora, E. and Casco, C. (2007). Region- and edge-based configurational effects in texture segmentation.
Vision Research, 47(7), 879-886.
Graham, N., Beck, J. and Sutter, A. (1992). Nonlinear processes in spatial-frequency channel models of
perceived texture segregation: Effects of sign and amount of contrast. Vision Research, 32(4), 719–
743.
Greenwood, J. A., Bex, P. J. and Dakin, S. C. (2009). Positional averaging explains crowding with letter-
like stimuli. Proceedings of the National Academy of Sciences of the United States of America,
106(31), 13130–13135.
20
Greenwood, J.A., Bex, P.J. and Dakin, S. C. (2012). Crowding follows the binding of relative position and
orientation. , 12(3), 1–20.
Gurnsey, R. and Browse, R. (1987). Micropattern properties and presentation conditions influencing
visual texture discrimination. Percept. Psychophys., 41, 239-252.
Haralick, R.M. (1979). Statistical and Structural Approaches to Texture. Proceedings of the IEEE, 67(5),
786–804.
Heeger, D. J. and Bergen, J. R. (1995). Pyramid-based texture analysis/synthesis. In Proceedings of the
22nd annual conference on Computer graphics and interactive techniques (SIGGRAPH ’95). IEEE
Comput. Soc. Press, 229–238.
Hess, R. F. (1982). Developmental sensory impairment: Amblyopia or tarachopia? Human neurobiology,
1, 17-29.
Hindi Attar, C., Hamburger, K., Rosenholtz, R., Götzl, H., and Spillman, L. (2007). Uniform versus random
orientation in fading and filling-in. Vision Research, 47(24), 3041–3051.
Julesz, B. (1962). Visual Pattern Discrimination. IRE Transactions on Information Theory, 8(2), 84–92.
Julesz, B. (1965). Texture and Visual Perception. Scientific American, 212, 38–48.
Julesz, B. (1975). Experiments in the visual perception of texture. Scientific American, 232(4), 34–43.
Julesz, B. (1981). A theory of preattentive texture discrimination based on first-order statistics of
textons. Biological Cybernetics, 41, 131–138.
Julesz, B., Gilbert, E. N., and Victor, J. D. (1978). Visual discrimination of textures with identical third-
order statistics. Biol. Cybernet. 31, 137-140.
Karu, K., Jain, A. and Bolle, R. (1996). Is there any texture in the image? Pattern Recognition, 29(9),
1437–1446.
Kooi, F. L., Toet, A., Tripathy, S. P., and Levi, D. M. The effect of similarity and duration on spatial
interaction in peripheral vision. Spatial vision, 8(2), 255-279.
Knutsson, H. and Granlund, G. (1983). Texture analysis using two-dimensional quadrature filters. In IEEE
Computer Society workshop on computer architecture for pattern analysis and image database
management (CAPAIDM). Silver Spring, MD: IEEE Computer Society Press, 206-213.
Koenderink, J.J., Richards, W. and van Doorn, A. J. (2012). Space-time disarray and visual awareness. i-
Perception, 3(3), 159–162.
Koenderink, J. J. and van Doorn, A. J. (2000). Blur and disorder. Journal of visual communication and
image representation, 11(2), 237-244.
Kröse, B. (1986). Local structure analyzers as determinants of preattentive pattern discrimination. Biol.
Cybernet., 55, 289-298.
Landy, M. S. and Graham, N. (2004). Visual Perception of Texture. In The Visual Neurosciences. 1106–
1118.
Lee, T.S. (1995). A Bayesian framework for understanding texture segmentation in the primary visual
cortex. Vision research, 35(18), 2643–2657.
Lettvin, J.Y. (1976). On seeing sidelong. The Sciences, 16, 10-20.
Leung, T. K. and Malik, J. (1996). Detecting, localizing, and grouping repeated scene elements from an
image. In Proc. 4th European Conf. on Computer Vision (ECVP ’96). London: Springer-Verlag, 1, 546-
555.
Levi, D. M. and Carney, T. (2009). Crowding in peripheral vision: why bigger is better. Current biology,
19(23), 1988–1993.
Levi, D. M. and Klein, S. A. (1986). Sampling in spatial vision. Nature, 320, 360-362.
Livne, T. and Sagi, D. (2007). Configuration influence on crowding. Journal of Vision, 7(2), 1–12.
21
Louie, E., Bressler, D. and Whitney, D. (2007). Holistic crowding: Selective interference between
configural representations of faces in crowded scenes. Journal of Vision, 7(2), 24.1–11.
Machilsen, B. and Wagemans, J. (2011). Integration of contour and surface information in shape
detection. Vision Research, 51, 179–186. doi:10.1016/j.visres.2010.11.005.
Mack, A., Tang, B., Tuma, R., Kahn, S., and Rock, I. (1992). Perceptual organization and attention.
Cognitive Psychology, 24, 475–501.
Malik, J. and Perona, P. (1990). Preattentive texture discrimination with early vision mechanisms.
Journal of the Optical Society of America. A, 7(5), 923–932.
Manassi, M., Sayim, B and Herzog, M. (2012). Grouping, pooling, and when bigger is better in visual
crowding. Journal of Vision, 12(10), 13.1–14.
Martelli, M., Majaj, N. and Pelli, D. (2005). Are faces processed like words? A diagnostic test for
recognition by parts. Journal of Vision, 5, 58–70.
Mutch, J. and Lowe, D. G. (2008). Object class recognition and localization using sparse features within
limited receptive fields. International Journal of Computer Vision, 80, 45-57.
Nothdurft, H. C. (1991). Texture segmentation and pop-out from orientation contrast. Vision research,
31(6), 1073–1078.
Oliva, A. and Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the
spatial envelope. International Journal of Computer Vision, 42(3), 145–175.
Oliva, A. and Torralba, A. (2006). Building the gist of a scene: the role of global image features in
recognition. Progress in brain research, 155, 23–36.
Olson, R. K. and Attneave, F. (1970). What Variables Produce Similarity Grouping? American Journal of
Psychology, 83(1), 1–21.
Parkes, L., Lund, J., Angelucci, A., Solomon, J. A., and Morgan, M. (2001). Compulsory averaging of
crowded orientation signals in human vision. Nature neuroscience, 4(7), 739–744.
Pelli, D. G., Palomares, M. and Majaj, N. (2004). Crowding is unlike ordinary masking: Distinguishing
feature integration from detection. Journal of vision, 4, 1136–1169.
Pelli, D. G. and Tillman, K. A. (2008). The uncrowded window of object recognition. Nature neuroscience,
11(10), 1129-1135.
Põder, E. and Wagemans, J. (2007). Crowding with conjunctions of simple features. Journal of vision,
7(2), 23.1–12.
Popat, K. and Picard, R. W. (1993). Novel cluster-based probability model for texture synthesis,
classification , and compression. In B. G. Haskell and H.-M. Hang (Eds.), Proc SPIE Visual
Communications and Image Processing ’93, 2094, 756-768.
Portilla, J. and Simoncelli, E. P. (2000). A Parametric Texture Model Based on Joint Statistics of Complex
Wavelet Coefficients. International Journal of Computer Vision, 40(1), 49–71.
Puzicha, J., Hofmann, T. and Buhmann, J. M. (1997). Non – parametric Similarity Measures for
Unsupervised Texture Segmentation and Image Retrieval. In Proc. Computer Vision and Pattern
Recognition, CVPR ’97, IEEE, 267–272.
Renninger, L. W. and Malik, J. (2004). When is scene identification just texture recognition? Vision
research, 44(19), 2301–2311.
Rentschler, I. and Treutwein, B. (1985). Loss of spatial phase relationships in extrafoveal vision. Nature,
313, 308-310.
Riesenhuber, M. and Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature
neuroscience, 2(11), 1019–1025.
22
Rosenholtz, R., Huang, J. Raj, A., Balas, B. J., and Ilie, L. (2012). A summary statistic representation in
peripheral vision explains visual search. Journal of vision, 12(4), 14. 1–17. doi: 10.1167/12.4.14.
Rosenholtz, R. (1999). General-purpose localization of textured image regions. In M. H. Wu et al. (Eds.),
Proc. SPIE, Human Vision and Electronic Imaging IV, 3644, 454–460. doi=10.1117/12.348465.
Rosenholtz, R. (2000). Significantly different textures: A computational model of pre-attentive texture
segmentation. In D. Vernon (Ed.), Proc. European Conf. on Computer Vision (ECCV ’00), LNCS 1843,
197–211.
Rosenholtz, R. (2011). What your visual system sees where you are not looking. In B. E. R. and T. N.
Pappas, (Eds.) SPIE: Human Vision and Electronic Imaging, XVI. 7865, 786510.
doi=10.1117/12.876659.
Rosenholtz, R., Huang, J. and Ehinger, K. A. (2012). Rethinking the role of top-down attention in vision:
Effects attributable to a lossy representation in peripheral vision. Frontiers in psychology, 3, 13.
doi:10.3389/fpsyg.2012.00013.
Rubenstein, B. S. and Sagi, D. (1996). Preattentive texture segmentation: the role of line terminations,
size, and filter wavelength. Perception & psychophysics, 58(4), 489–509.
Saarela, T. P., Sayim, B., Westheimer, G., and Herzog, M. H. (2009). Global stimulus configuration
modulates crowding. Journal of Vision, 9(2), 5.1–11.
Sayim, B., Westheimer G., Herzog, M. H. (2010). Gestalt Factors Modulate Basic Spatial Vision.
Psychological Science, 21(5), 641-644.
Simoncelli, E. P. and Olshausen, B. A. (2001). Natural image statistics and neural representation. Annual
review of neuroscience, 24, 1193–1216.
Strasburger, H. (2005). Unfocused spatial attention underlies the crowding effect in indirect form vision.
Journal of vision, 5(11), 1024–1037.
Sutter, A., Beck, J. and Graham, N. (1989). Contrast and spatial variables in texture segregation: Testing a
simple spatial-frequency channels model. Perception & psychophysics, 46(4), 312–332.
Tola, E., Lepetit, V. and Fua, P. (2010). DAISY: an efficient dense descriptor applied to wide-baseline
stereo. IEEE transactions on pattern analysis and machine intelligence, 32(5), 815–830.
Tomita, F., Shirai, Y. and Tsuji, S. (1982). Description of Textures by a Structural Analysis. IEEE
transactions on pattern analysis and machine intelligence, PAMI-4(2), 183–191.
Treisman, A. (1985). Preattentive processing in vision. Computer Vision, Graphics, and Image Processing,
31, 156–177.
Turner, M.R. (1986). Texture discrimination by Gabor functions. Biological Cybernetics, 55, 71–82.
van den Berg, R., Johnson, A., Martinez Anton, A., Schepers, A. L., and Cornelissen, F. W. (2012).
Comparing crowding in human and ideal observers. , 12(8), 1–15.
Velardo, C. and Dugelay, J.-L. (2010). Face recognition with DAISY descriptors. In Proc. 12th ACM
workshop on multimedia and security,ACM, 95-100.
Victor, J. D. and Brodie, S. (1978). Discriminable textures with identical Buffon Needle statistics. Biol.
Cybernet., 31, 231-234.
Voorhees, H. and Poggio, T. (1988). Computing texture boundaries from images. Nature, 333, 364–367.
Wechsler, H. (1980). Texture analysis -- a survey. Signal Processing, 2, 271–282.
Wang, J.-G., Li, J., W.-Y. Yau, and E. Sung (2010). Boosting dense SIFT descriptors and shape contexts of
face images for gender recognition. In Proc. Computer Vision and Pattern Recognition Workshop
(CVPRW ’10), San Francisco, CA, 96-102.
23
Zetzsche, C., Barth, E., and Wegmann, B. (1993). The importance of intrinsically two-dimensional image
features in biological vision and picture coding. In A. B. Watson (Ed.), Digital images and human
vision. Cambridge, MA: MIT Press, 109-138.
Zhu, S., Wu, Y. N., and Mumford, D. (1996). Filters, random fields and maximum entropy (FRAME) –
Towards the unified theory for texture modeling. In IEEE Conf. Computer Vision and Pattern
Recognition, 693-696.
Zhu, C., Bichot, C. E., and Chen, L. (2011). Visual object recognition using daisy descriptor. In Proc. IEEE
Intl. Conf. on Multimedia and Expo (ICME 2011), Barcelona, Spain, 1-6.
Zucker, S. W. (1976). Toward a model of texture. Computer Graphics and Image Processing, 5(2), 190–
202
24