Date post: | 27-Jan-2017 |
Category: |
Technology |
Upload: | zukun |
View: | 220 times |
Download: | 0 times |
Synthesis for understanding and evaluating vision systems
Eero Simoncelli
Howard Hughes Medical Institute,Center for Neural Science, and
Courant Institute of Mathematical SciencesNew York University
Frontiers in Computer Vision WorkshopMIT, 21-24 Aug 2011
Computer vision
Robotics
Optics/imaging
Machinelearning
Image processing
Computer graphics
Visual neuroscience
Visual perception
Computer vision
Robotics
Optics/imaging
Machinelearning
Image processing
Computer graphics
Visual neuroscience
Visual perception
Why should computer vision care about biological vision?
RetinaOpticNerve
LGNOpticTract
VisualCortex
Why should computer vision care about biological vision?
•Optimized for general-purpose vision
RetinaOpticNerve
LGNOpticTract
VisualCortex
Why should computer vision care about biological vision?
•Optimized for general-purpose vision
RetinaOpticNerve
LGNOpticTract
VisualCortex
•Determines/limits what is perceived
Why should computer vision care about biological vision?
•Optimized for general-purpose vision
RetinaOpticNerve
LGNOpticTract
VisualCortex
•Determines/limits what is perceived
•Useful scientific testing methodologies
Illustrative example: building a classifier
1. Transform input to some feature space
2. Use ML to learn parameters on a large (labelled) data set
3. Test on another data set
4. Repeat
Illustrative example: building a classifier
1. Transform input to some feature space
2. Use ML to learn parameters on a large (labelled) data set
3. Test on another data set
4. Repeat
Which features?
[Adelson & Bergen, 1985]
Which features?Oriented filters: capture stimulus-dependency of neural responses in primary visual cortex (area V1)
Simple cell
Complex cell +
[Adelson & Bergen, 1985]
Which features?Oriented filters: capture stimulus-dependency of neural responses in primary visual cortex (area V1)
Simple cell
Complex cell +
Simple cell
Complex cell +
[Adelson & Bergen, 1985]
Which features?Oriented filters: capture stimulus-dependency of neural responses in primary visual cortex (area V1)
Simple cell
Complex cell +
Simple cell
Complex cell +
[Adelson & Bergen, 1985]
[Carandini, Heeger, and Movshon, 1996]
Retinal image
Firing rate
Retinal image
Firing rate
Other cortical cells
Retinal image
Firing rate
Other cortical cells
RC circuit implementation
The linear model of simple cells
The normalization model of simple cells
[Carandini, Heeger, and Movshon, 1996]
Retinal image
Firing rate
Retinal image
Firing rate
Other cortical cells
Retinal image
Firing rate
Other cortical cells
RC circuit implementation
The linear model of simple cells
The normalization model of simple cells
Similarly, contrast gain control depends on the root-mean-square contrast falling over a region centered over the RF(Shapley and Victor, 1979, 1981), which we term suppressivefield (Bonin et al., 2005, 2006).We posit that this measure of localcontrast sets the conductance of the contrast gain control RCstage (Figure 3A).
The validity of this choice can be tested on the basis of a simpleprediction: increasing the size of a grating should affect the gainand the integration time of the RF exactly in the same way asa matched increase in contrast (Shapley and Victor, 1979,1981). Indeed, in the model both manipulations result in strongereffects of contrast gain control. We confirmed this prediction bymeasuring temporal weighting functions from responses to drift-ing gratings varying in contrast and diameter. Indeed, increasingdiameter reduced both the gain and the integration time, thesame effects seen when increasing contrast (Figure 3B, black).To model these effects, we allowed the conductance of thecontrast gain control stage (Figure 3A) to vary with stimulusdiameter as well as contrast (Figure 3C). The resulting temporalweighting functions closely resemble the ones estimated individ-ually (Figure 3B, compare black and red) and predict the re-sponses to gratings of various contrast and diameter almost aswell (72% versus 75% stimulus-driven variance explained forthe example cell; 77% versus 82% over the population, n = 34,median). The curves relating grating contrast to conductance,which depend on grating diameter (Figure 3C), could be madeto lie on a single line by appropriate horizontal shifts (Figure 3D)indicating that the effects of increasing diameter could beexactly matched by an appropriate increase in contrast. Thehorizontal shifts determine the weight contributed by each stim-ulus diameter (Figures 3E and 3F), and therefore allow us toestimate the size of the suppressive field. Defining size as thediameter corresponding to half of the total volume, we find that
on average the suppressive field is 2.0 ± 0.2 (s.e., bootstrap es-timate, n = 34) times larger than the center of the RF (Figure 3F).These estimates are consistent with earlier measures based onlyon response gain (Bonin et al., 2005).As in previous work, we postulate that local contrast is com-
puted from the output of the light adaptation stage and is com-bined across a number of neurons (subunits) having spatially dis-placed RFs (Bonin et al., 2005; Shapley and Victor, 1979). Theoutputs of the subunits are squared and combined in a weightedsum, and the result is square rooted (Bonin et al., 2006). Theweights are given by the profile of the suppressive field(Figure 3A). Because the responses of the subunits are shapedby light adaptation, which has a divisive effect on the responses,at steady state this computation of local contrast reduces to thecommon definition of root-mean-square contrast (Shapley andEnroth-Cugell, 1984), the ratio between the standard deviationand the mean of the local luminance distribution (ExperimentalProcedures).
Temporal Dynamics of Fast AdaptationFinally, to apply the model to arbitrary scenes, we must specifyhow the signals driving the adaptationmechanisms are integratedover time. Thismatter has been extensively studied, and based onthe literature we made two assumptions. First, we assumed thatthe measure of local luminance extends over !100 ms in therecent past (Enroth-Cugell and Shapley, 1973a; Lankheet et al.,1993a; Lee et al., 2003; Saito and Fukada, 1986; Yeh et al.,1996). Second, we assumed that the measure of local contrastis determined entirely by the responses of the subunits, with nofurther temporal integration. Thus, the measure of local contrastis estimated over a brief interval (Alitto and Usrey, 2008; Baccusand Meister, 2002; Victor, 1987), whose duration is shorter whenlocal luminance is high, and longer when local luminance is low.
Figure 3. The Spatial Footprint of Fast Adaptation(A) Local luminance is the average luminance falling over the RF in
a recent period of time. Local contrast is computed by the
suppressive field, by taking the square root of the squared and
integrated responses of a pool of subunits.
(B) The temporal weighting function measured with gratings of
various contrast and diameter (black). Fits of the model (red)
were obtained by estimating one conductance value for the sec-
ond RC stage for each combination of contrast and diameter.
(C) The estimated conductance increases with both contrast
(abscissa) and diameter (white to black).
(D) The four sets of conductance values can be aligned by shifting
them along the horizontal axis. The resulting curve describes
how conductance depends on local contrast. Red line is linear
regression.
(E) The volume under the portion of the suppressive field covered
by the stimuli of different diameter. The data points are obtained
from the magnitude of the shifts needed to align the curves in
(C). The curve is the fit of a descriptive function (Experimental Pro-
cedures). For this neuron, the size of the center of the RF is 1.0".
(F) Average over all neurons. Stimulus diameter is normalized by
the size of the center of the RF. Error bars indicate two SE.
Neuron
Visual Responses to Artificial and Natural Stimuli
628 Neuron 58, 625–638, May 22, 2008 ª2008 Elsevier Inc.
[Mante, Bonin & Carandini 2008]
Dynamic retina/LGN model
. . .
. . .
. . . 2
Input: V1 afferents
Output: MT neurons tuned forlocal image velocity
+
. . .
. . .
. . .12
Input: image intensities
Output: V1 neurons tuned forspatio-temporal orientation
1
+
LinearReceptive
Field
Half-squaringRectification
DivisiveNormalization
[Simoncelli & Heeger, 1998]
2-stage MT model
. . .
. . .
. . . 2
Input: V1 afferents
Output: MT neurons tuned forlocal image velocity
+
. . .
. . .
. . .12
Input: image intensities
Output: V1 neurons tuned forspatio-temporal orientation
1
+
LinearReceptive
Field
Half-squaringRectification
DivisiveNormalization
[Simoncelli & Heeger, 1998]
2-stage MT model
Biology uses cascades of canonical operations....
• Linear filters (local integrals and derivatives): selectivity/invariance
• Static nonlinearities (rectification, exponential, sigmoid): dynamic range control
• Pooling (sum of squares, max, etc): invariance
• Normalization: preservation of tuning curves, suppression by non-optimal stimuli
Improved object recognition?“In many recent object recognition systems, feature extraction stages are generally composed of a filter bank, a non-linear transformation, and some sort of feature pooling layer [...] We show that using non-linearities that include rectification and local contrast normalization is the single most important ingredient for good accuracy on object recognition benchmarks. We show that two stages of feature extraction yield better accuracy than one....”
- From the abstract of “What is the Best Multi-Stage Architecture for Object Recognition?”Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato and Yann LeCun ICCV-2009
Using synthesis to test models I: Gender classification
•200 face images (100 male, 100 female)•Labeled by 27 human subjects•Four linear classifiers trained on subject data
[Graf & Wichmann, NIPS*03]
Linear classifiersSVM RVM Prot FLD
Linear classifiersSVM RVM Prot FLD
Linear classifiersSVM RVM Prot FLD
SVM RVM Prot FLD trainedon
!W
truedata
!W
subjdata
classifier vectors may be visualized as images:
!=−21 !=−14 !=−7 !=0 !=7 !=14 !=21
SVM
RVM
Prot
FLD
Add classifierSubtract classifier
[Wichmann, Graf, Simoncelli, Bülthoff, Schölkopf, NIPS*04]
Validation by “gender-morphing”
50
100
%Correct
Amount of classifier image added/subtracted(arbitrary units)
1.0 2.0 4.0 8.00.50.25
SVMRVMProtoFLD
[Wichmann, Graf, Simoncelli, Bülthoff, Schölkopf, NIPS*04]
Perceptual validationH
uman
subj
ect r
espo
nses
of visual re-representations, from V1 to V2 to V4 to ITcortex (Figure 2). Beginning with the studies of Gross [27],a wealth of work has shown that single neurons at thehighest level of the monkey ventral visual stream – the ITcortex – display spiking responses that are probably usefulfor object recognition. Specifically, many individual ITneurons respond selectively to particular classes of objects,such as faces or other complex shapes, yet show sometolerance to changes in object position, size, pose andillumination, and low-level shape cues. (Also see e.g.Ref. [28] for recent related results in humans.)
How can the responses of individual ventral streamneurons provide insight into object manifold untanglingin the brain? To approach this, we have focused on char-acterizing the initial wave of neuronal population ‘images’that are successively produced along the ventral visual str-eam as the retinal image is transformed and re-representedon its way to the IT cortex (Figure 2). For example, we andour collaborators recently found that simple linear classi-fiers can rapidly (within <300 ms of image onset) andaccurately decide the category of an object from the firingrates of an IT population of!200 neurons, despite variationin object position and size [19]. It is important to note thatusing ‘stronger’ (e.g. non-linear) classifiers did not substan-tially improve recognition performance and the same
classifiers fail when applied to a simulated V1 populationof equal size [19]. This shows thatperformance isnota resultof the classifiers themselves, but the powerful form of visualrepresentation conveyed by the IT cortex. Thus, comparedwith early visual representations, object manifolds are lesstangled in the IT population representation.
To show this untangling graphically, Figure 3 illustratesthe manifolds of the faces of Sam and Joe from Figure 1d(retina-like representation) re-represented in the V1 and ITcortical population spaces. To generate these, we took popu-lations of simulated V1-like response functions (e.g. Refs[29,30]) and IT-like response functions (e.g. Refs [31,32]),and applied them to all the images of Joe and Sam.This reveals that the V1 representation, like the retinalrepresentation, still contains highly curved, tangled objectmanifolds (Figure 3a), whereas the same object manifoldsare flattened and untangled in the IT representation(Figure 3b). Thus, from the point of view of downstreamdecisionneurons, the retinal andV1 representations are notin a good format to separate Joe from the rest of the world,whereas the IT representation is. In sum, the experimentalevidence suggests that the ventral stream transformation(culminating in IT) solves object recognition by untanglingobjectmanifolds.For eachvisual image striking the eye, thistotal transformation happens progressively (i.e. stepwise
Figure 2. Neuronal populations along the ventral visual processing stream. The rhesus monkey is currently our best model of the human visual system. Like humans,monkeys have high visual acuity, rely heavily on vision (!50% of macaque neocortex is devoted to vision) and easily perform visual recognition tasks. Moreover, themonkey visual areas have been mapped and are hierarchically organized [26], and the ventral visual stream is known to be critical for complex object discrimination(colored areas, see text). We show a lateral schematic of a rhesus monkey brain (adapted from Ref. [26]). We conceptualize each stage of the ventral stream as a newpopulation representation. The lower panels schematically illustrate these populations in early visual areas and at successively higher stages along the ventral visual stream– their relative size loosely reflects their relative output dimensionality (approximate number of feed-forward projection neurons). A given pattern of photons from the world(here, a face) is transduced into neuronal activity at the retina and is progressively and rapidly transformed and re-represented in each population, perhaps by a commontransformation (T). Solid arrows indicate the direction of visual information flow based on neuronal latency (!100 ms latency in IT), but this does not preclude fast feedbackboth within and between areas (dashed arrows, see Box 1). The gray arrows across the bottom indicate the population representations for the retina, V1 and IT, which areconsidered in Figures 1d and 3a,b, respectively. RGC, retinal ganglion cells; LGN, lateral geniculate nucleus.
Opinion TRENDS in Cognitive Sciences Vol.11 No.8 337
www.sciencedirect.com
Using synthesis to test models II: Ventral stream representation
[DiCarlo & Cox, 2007]
V1
V2
V4
a
b
Receptive field center (deg)
Receptive fie
ld s
ize (
deg)
0 5 10 15 20 25 30 35 40 45 50
0
5
10
15
20
25
V1 V2 V4
Deg
Deg
Deg Deg
!"# 0 40
!"#
0
40
0 400 !"# 0 40!"#
Figure 1. Physiological measurements of
receptive field size in macaque. (a) Receptive
field size (diameter) as a function of receptive
field center (eccentricity) for visual areas V1,
V2, and V4. Data adapted from Gattass et al.
(1981) and Gattass et al. (1988). The size-to-
eccentricity relationship in each area is well
described by a “hinged” line (see Methods).
(b) Cartoon depiction of receptive fields with
sizes based on physiological measurements.
The center of each array is the fovea. The size
of each circle is proportional to its eccentricity,
based on the corresponding scaling param-
eter (slope of the fitted line in a). At a given
eccentricity, a larger scaling parameter implies
larger receptive fields. In our model, we use
overlapping pooling regions that uniformly tile
the image and are separable and of constant
size in polar angle and log eccentricity
(Supplementary Fig. 1).
[Gattass et. al., 1981; Gattass et. al., 1988]
Eccentricity, receptive field center (deg)
V1 V2 V4 IT
V1
V2
V4
IT
[Freeman & Simoncelli, Nature Neurosci, Sep 2011]
Ventral stream“complex” cell
+
250
150
25 40170
Ventral streamreceptive fields
Canonical computation
V1 cells
[Freeman & Simoncelli, Nature Neurosci, Sep 2011]
3.11.412.5
.
.
.
Ventral stream“complex” cell
+
250
150
25 40170
Ventral streamreceptive fields
Canonical computation
V1 cells
[Freeman & Simoncelli, Nature Neurosci, Sep 2011]
How do we test this?
3.11.412.5
.
.
.
Ventral stream“complex” cell
+
250
150
25 40170
Ventral streamreceptive fields
Canonical computation
V1 cells
[Freeman & Simoncelli, Nature Neurosci, Sep 2011]
[Freeman & Simoncelli, Nature Neurosci, Sep 2011]
model
250
150
25
170
40
model
250
150
25
170
40
Original imageModel
responses
3.11.412.5
.
.
.
Synthesized image
Scientific prediction: such images should look the same (“Metamers”)
Idea: synthesize random samples from the equivalence class of images with identical model responses
[Freeman & Simoncelli, Nature Neurosci, Sep 2011]
Original imageModel
responses
3.11.412.5
.
.
.
Synthesized image
Scientific prediction: such images should look the same (“Metamers”)
Idea: synthesize random samples from the equivalence class of images with identical model responses
[Freeman & Simoncelli, Nature Neurosci, Sep 2011]
Original imageModel
responses
3.11.412.5
.
.
.
Synthesized image
Scientific prediction: such images should look the same (“Metamers”)
Idea: synthesize random samples from the equivalence class of images with identical model responses
original image
synthesized image: should look the same when you fixate on the red dot
Reading
[Freeman & Simoncelli, Nature Neurosci, Sep 2011]b
c
aFigure 7. Effects of crowding
on reading and searching.
!"#$Two metamers, matched
to the model responses of a
page of text from the first
paragraph of Herman
Melville’s “Moby Dick”. Each
metamer was synthesized
using a different foveal
location (the letter above each
red dot). These locations are
separated by the distance
readers typically traverse
between fixations49. In each
metamer, the central word is
largely preserved; farther in
the periphery the text is
letter-like but scrambled, as if
printed with non-latin
characters. Note that the
boundary of readability in the
first image roughly coincides
with the location of the fixation
in the second image. We
emphasize that these are
samples drawn from the set of
images that are perceptually
metameric; although they
illustrate the kinds of
distortions that result from the
model, no single example
represents “what an observer
sees” in the periphery. (b) The
notoriously hard-to-find
“Waldo” (character with the
red and white striped shirt)
blends into the distracting
background, and is only
recognizable when we (or the
model) look right at him.
Cross-hairs surrounding each
image indicate the location of
the model fovea. (c) A soldier
in Afghanistan wears
sandy-stone patterned
clothing to match the stoney
texture of the street, and
similarly blends into the
background.
Camouflage
b
c
aFigure 7. Effects of crowding
on reading and searching.
!"#$Two metamers, matched
to the model responses of a
page of text from the first
paragraph of Herman
Melville’s “Moby Dick”. Each
metamer was synthesized
using a different foveal
location (the letter above each
red dot). These locations are
separated by the distance
readers typically traverse
between fixations49. In each
metamer, the central word is
largely preserved; farther in
the periphery the text is
letter-like but scrambled, as if
printed with non-latin
characters. Note that the
boundary of readability in the
first image roughly coincides
with the location of the fixation
in the second image. We
emphasize that these are
samples drawn from the set of
images that are perceptually
metameric; although they
illustrate the kinds of
distortions that result from the
model, no single example
represents “what an observer
sees” in the periphery. (b) The
notoriously hard-to-find
“Waldo” (character with the
red and white striped shirt)
blends into the distracting
background, and is only
recognizable when we (or the
model) look right at him.
Cross-hairs surrounding each
image indicate the location of
the model fovea. (c) A soldier
in Afghanistan wears
sandy-stone patterned
clothing to match the stoney
texture of the street, and
similarly blends into the
background.
[Freeman & Simoncelli, Nature Neurosci, Sep 2011]
Cascades of linear filtering, squaring/products, averaging over local regions....
Cascades of linear filtering, squaring/products, averaging over local regions....
Can this really lead to object recognition?
“Perhaps texture, somewhat redefined, is the primitive stuff out of which form is constructed”
- Jerome Lettvin, 1976
Cascades of linear filtering, squaring/products, averaging over local regions....
Can this really lead to object recognition?