Learning Spatial Context:1 1
Using Stuff to Find Things2 2
Geremy Heitz Daphne Koller3 3
Department of Computer Science4 4
Stanford University5 5
Stanford, CA 943056 6
Abstract. The sliding window approach of detecting rigid objects (such7 7
as cars) is predicated on the belief that the object can be identified8 8
from the appearance in a defined region around the object. Other types9 9
of objects of amorphous spatial extent (e.g., trees, sky), however, are10 10
more naturally classified based on texture or color. In this paper, we11 11
seek to combine recognition of these two types of objects into a system12 12
that leverages “context” toward improving detection. In particular, we13 13
cluster regions of the image based on their ability to serve as context14 14
for the detection of objects. Rather than providing an explicit training15 15
set with region labels, our method automatically groups regions based16 16
on both their appearance and their relationships to the detections in the17 17
image. We show that our things and stuff (TAS) context model produces18 18
meaningful clusters that are readily interpretable, and helps improve our19 19
detection ability over state-of-the-art detectors. We present results on20 20
object detection in images from the PASCAL VOC 2005/2006 datasets21 21
and on the task of overhead car detection in satellite images.22 22
1 Introduction23 23
Recognizing objects in an image requires combining many different signals from24 24
the raw image data. Figure 1 shows an example satellite image of a street scene,25 25
where we may want to identify all of the cars. From a human perspective, there26 26
are two primary signals that we leverage. The first is the local appearance in the27 27
window near the potential car. The second is our knowledge that cars appear28 28
on roads. This second signal is a form of contextual knowledge, and our goal in29 29
this paper is to capture this idea in a rigorous probabilistic model.30 30
Recent papers have demonstrated that boosted object detectors can be effec-31 31
tive at detecting monolithic object classes, such as cars [1] or faces [2]. Similarly,32 32
several works on multiclass segmentation have shown that regions in images33 33
can effectively be classified based on their color or texture properties [3]. These34 34
two lines of work have made significant progress on the problems of identifying35 35
“things” and “stuff,” respectively. The important differentiation between these36 36
two classes of visual objects is summarized in Forsyth et al. [4] as:37 37
The distinction between materials — “stuff” — and objects — “things”38 38
— is particularly important. A material is defined by a homogeneous or39 39
repetitive pattern of fine-scale properties, but has no specific or distinc-40 40
tive spatial extent or shape. An object has a specific size and shape.41 41
2 Heitz & Koller
Fig. 1. (Left) An aerial photograph. (Center) Detected cars in the image (solid green= true detections, dashed red = false detections). (Right) Finding “stuff” such asbuildings, by classifying regions, shown delineated by red boundaries.
Recent work has also shown that classifiers targeting particular classes of42 42
things or stuff can benefit from the proper use of contextual cues. The use of43 43
context can be split into a number of categories. Scene-Thing context allows44 44
scene-level information, such as the scale or the “gist” [5], to determine prior45 45
location probabilities for the various objects. Stuff-Stuff context captures the no-46 46
tion that sky occurs above sea and road below building [6]. Thing-Thing context47 47
considers the co-occurrence of objects such as the notion that a tennis racket is48 48
more likely to occur with a tennis ball than with a lemon [7]. Finally Stuff-Thing49 49
context allows the texture regions (e.g., the roads and buildings in Figure 1) to50 50
add predictive power to the detection of objects (e.g., the cars in Figure 1). We51 51
focus on this fourth type of context. Figure 2 shows an example of this context52 52
in the case of satellite imagery.53 53
In this paper, we present a probabilistic model linking the detection of things54 54
to the unsupervised classification of stuff. Our method can be viewed as an at-55 55
tempt to cluster “stuff,” represented by coherent image regions, into clusters56 56
that are both visually similar and best able to provide context for the detectable57 57
“things” in the image. Cluster labels for the image regions are probabilistically58 58
linked to the detection window labels through the use of region-detection “rela-59 59
tionships,” which encode their relative spatial locations. We call this model the60 60
things and stuff (TAS) context model because it links these two components into61 61
a coherent whole. The graphical representation of the TAS model is depicted in62 62
Figure 3. At training time, this model leverages supervised (groundtruth) de-63 63
tection labels to learn parameters using the Expectation-Maximization (EM)64 64
algorithm. For test instances, both the region labels and detection labels are65 65
inferred from the model.66 66
We present results of the TAS method on diverse datasets. Using the Pascal67 67
Visual Object Classes challenge datasets from 2005 and 2006 (VOC2005 and68 68
VOC2006), we utilize one of the top competitors as our baseline detector [10],69 69
and demonstrate that our TAS method improves the performance of detecting70 70
cars, bicycles and motorbikes in street scenes and cows and sheep in rural scenes71 71
(see Figure 4). In addition, we consider a very different dataset of satellite images72 72
from Google Earth, of the city and suburbs of Brussels, Belgium. The task here73 73
is to identify the cars from above; see Figure 2. For clarity, the satellite data is74 74
used as the running example throughout the paper, but all descriptions apply75 75
equally to the other datasets.76 76
Learning Spatial Context 3
Fig. 2. Example detections from the satellite dataset that demonstrate context. Clas-sifying using local appearance only, we might think that both windows at left are cars.However, when seen in context, the bottom detection is unlikely to be an actual car.
2 Related Work77 77
The role of context in object recognition has become an important topic, due78 78
both to the psychological basis of context in the human visual system [11] and to79 79
the striking algorithmic improvements that “visual context” has provided [12].80 80
The word “context” has been attached to many different ideas. One of the81 81
simplest forms is co-occurrence context. The work of Rabinovich et al. [7] demon-82 82
strates the use of this context, where the presence of a certain object class in an83 83
image probabilistically influences the presence of a second class. The context of84 84
Torralba et al. [12] assumes that certain objects occur more frequently in cer-85 85
tain rooms, as monitors tend to occur in offices. While these methods achieve86 86
excellent results when many different object classes are labeled per image, they87 87
are unable to leverage unsupervised data for contextual object recognition.88 88
In addition to co-occurrence context, many approaches take into account89 89
the spatial relationships between objects. At the local descriptor level, Wolf et90 90
al. [13] detect objects using a descriptor with a large capture range, allowing the91 91
detection of the object to be influenced by surrounding image features. Because92 92
these methods use only the raw features for context, however, they cannot obtain93 93
a holistic view of an entire scene. Similarly, Fink and Perona [14] use the output94 94
of boosted detectors for other classes as additional features for the detection of95 95
a given class. This allows the inclusion of signal beyond the raw features, but96 96
requires that all “parts” of a scene be supervised. In contrast, Murphy et al. [5]97 97
use a global feature known as the “gist” to learn statistical priors on the locations98 98
of various objects within the context of the specific scene. The gist descriptor99 99
is excellent at predicting the scene type and large structures in the scene, but100 100
cannot handle the local interactions present in the satellite data, for example.101 101
Another approach to modeling spatial relationships is to use a Markov Ran-102 102
dom Field (MRF) or variant (CRF,DRF) [8, 9] to encode the preferences for cer-103 103
tain spatial relationships. These techniques offer a great deal of flexibility in the104 104
formulation of the affinity function and all the standard benefits of a graphical105 105
model formulation (e.g., well-known learning and inference techniques). Singhal106 106
et al. [6] also use similar concepts to the MRF formulation to aggregate de-107 107
cisions across the image. These methods, however, suffer from two drawbacks.108 108
First, they tend to require a large amount of annotation in the training set. Sec-109 109
ond, they put things and stuff on the same footing, representing both as “sites”110 110
4 Heitz & Koller
in the MRF. Our method requires less annotation and allows detections and111 111
image regions to be represented in their (different) natural spaces.112 112
Perhaps the most ambitious attempts to use context involves the attempt to113 113
model the scene of an image holistically. Torralba [1], for instance, uses global114 114
image features to “prime” the detector with the likely presence/absence of ob-115 115
jects, the likely locations, and the likely scales. The work of Hoiem and Efros [15]116 116
takes this one level further by explicitly modeling the 3D layout of the scene.117 117
This allows the natural use of scale and location constraints (e.g., things closer118 118
to the camera are larger). Their approach, however, is tailored to street scenes,119 119
and requires domain-specific modeling. The specific form of their priors would120 120
be useless in the case of satellite images, for example.121 121
3 Things and Stuff (TAS) Context Model122 122
Our probabilistic context model builds on two standard components that are123 123
commonly used in the literature. The first is sliding window object detection,124 124
and the second is unsupervised image region clustering. A common method for125 125
finding “things” is to slide a window over the image, score each window’s match126 126
to the model, and return the highest matching such windows. We denote the127 127
features in the ith candidate window by Wi, the presence/absence of the target128 128
class in that window by Ti (T for “thing”), and assume that what we learn in129 129
our detector is a conditional probability P (Ti | Wi) from the window features130 130
to the probability that the window contains the object; this probability can be131 131
derived from most standard classifiers, such as the highly successful boosting132 132
approaches [1]. The set of windows included in the model can, in principle,133 133
include all windows in the image. We, however, limit ourselves to looking at the134 134
top scoring detection windows according to the detector (i.e., all windows above135 135
some low threshold, chosen in order to capture most of the true positives).136 136
The second component we build on involves clustering coherent regions of the137 137
image into groups based on appearance. We begin by segmenting the image into138 138
regions, known as superpixels, using the normalized cut algorithm of Ren and139 139
Malik [16]. For each region j, we extract a feature vector F j that includes color140 140
and texture features. For our stuff model, we will use a generative model where141 141
each region has a hidden class, and the features are generated from a Gaussian142 142
distribution with parameters depending on the class. Denote by Sj (S for “stuff”)143 143
the (hidden) class label of the jth region. We assume that the features are derived144 144
from a standard naive Bayes model, where we have a probability distribution145 145
P (F j | Sj) of the image features given the class label.146 146
In order to relate the detector for “things” (T ’s) to the clustering model over147 147
“stuff” (S’s), we need to develop an intuitive representation for the relationships148 148
between these units. Human knowledge in this area comes in sentences like “cars149 149
drive on roads,” or “cars park 20 feet from buildings.” We introduce variables150 150
Rij that represent the relationship between candidate detection i and region151 151
j. The values of Rij indicate different relationships, such as: “detection i is in152 152
region j”, or “detection i is about 100 pixels away from region j”.153 153
We can now link our component models into a single coherent probabilistic154 154
things and stuff (TAS) model, as depicted in the plate model of Figure 3(a).155 155
Learning Spatial Context 5
RijTi Sj
Fj
ImageWindow
Wi
Wi: Window
Ti: Object Presence
Sj: Region Label
Fj: Region Features
Rij: Relationship
N
J
(a) TAS plate model
T1
S1
S2
S3
S4
S5
T2
T3
R21 = “Above”
R31 = “Left”
R13 = “In”
R33 = “In”
R11 = “Left”
CandidateWindows
ImageRegions
(b) TAS ground network
Fig. 3. The TAS model. The plate representation (a) gives a compact visualization ofthe model, which unrolls into a “ground” network (b) for any particular image.
Probabilistic influence flows between the detection window labels and the image156 156
region labels through the v-structures that are activated by observing the rela-157 157
tionship variables. For a particular input image, this plate model unrolls into a158 158
“ground” network that encodes a distribution over the detections and regions of159 159
the image. Figure 3(b) shows a toy example of a ground network for the image of160 160
Figure 1. It is interesting to note the similarities between TAS and the MRF ap-161 161
proaches in the literature. In effect, the relationship variables link the detections162 162
and the regions into a probabilistic web where signals at one point in the web163 163
(say the strong appearance of a particular region) influence a detection in an-164 164
other part of the image, which in turn influences the label of a different region in165 165
yet another location. In the example of Figure 3(b), if region 3 has strong road166 166
appearance, it might influence T1 through the relationship R13. This in turn167 167
will influence the labeling of region 1 through the relationship R11. Because our168 168
method uses a generative model for the region labels, training can be performed169 169
very effectively, even when the Sj variables are unobserved (see below).170 170
All variables in the TAS model are discrete except for the feature F j vari-ables. This allows for simple table conditional probability distributions (CPDs)for all discrete nodes in this Bayesian network. The probability distribution overthese variables decomposes according to:
P (TSFRW ) =∏
i
P (Wi)P (Ti | Wi)∏
j
P (Sj)P (F j | Sj)∏
ij
P (Rij | Ti, Sj).
171 171One of the main benefits of the TAS approach is its simplicity and modularity.172 172
It allows us to “plug in” any sliding window detector and any generative approach173 173
for region clustering (e.g., [3]).174 174
6 Heitz & Koller
4 Learning and Inference in the TAS Model175 175
Because TAS unrolls into a Bayesian network for each image, we can use standard176 176
learning and inference methods. In particular, we learn the parameters of our177 177
model using the Expectation-Maximization (EM) [17] algorithm and perform178 178
inference using Gibbs sampling, a standard variant of MCMC sampling [18].179 179
Learning the Model with EM. At learning time, we have a set of images180 180
with annotated labels for our target object class(es). We first train the base181 181
detectors using this set, and select as our candidate windows all detections above182 182
a threshold; we thus obtain a set of candidate detection windows W1 . . .WN183 183
along with their groundtruth labels T1 . . . TN . We also have a set of regions and184 184
a feature vector Fj for each, and an Rij relationship variable for every window-185 185
region pair. Our goal is to learn parameters for the TAS model.186 186
One option is to learn in two phases: we first use an existing clustering method187 187
(such as K-means or Mixture-of-Gaussian clustering) to assign the Sj variables188 188
in the training data; we then learn parameters for the TAS model with fully ob-189 189
served data. This approach is attractive as it allows the first part of the learning190 190
to be fully unsupervised, enabling us to use more images than exist in our labeled191 191
dataset. We call the model learned with this method Pre-Clustered TAS, but192 192
while the resulting clusters are likely to be visually coherent, they do not exploit193 193
the spatial relationships between the regions and the object detections.194 194
We therefore propose a joint training method where we perform EM in the195 195
full model; this model is called the Full TAS model. The EM algorithm iterates196 196
between using probabilistic inference to derive a soft completion of the hidden197 197
variables (E-step) and finding maximum-likelihood parameters relative to this198 198
soft completion (M-step). The E-step here is particularly easy: at training time,199 199
only the Sj ’s are unobserved; moreover, because the T variables are observed,200 200
the Sj ’s are conditionally independent of each other. Thus, the inference step201 201
turns into a simple computation for each Sj separately, a process which can be202 202
performed in linear time. The M-step for table CPDs can be performed easily in203 203
closed form. To provide a good starting point for EM, we initialize the cluster204 204
assignments using the K-means algorithm. EM is guaranteed to converge to a205 205
local maximum of the likelihood function of the observed data.206 206
Inference with Gibbs Sampling. At test time, our system must determinewhich windows in a new image contain the target object. We observe the can-didate detection windows (Wi’s, extracted by thresholding the base detectoroutput), the features of each image region (F j ’s), and the relationships (Rij ’s).Our task is to find the probability that each window contains the object:
P (T | F , R, W ) =∑
S
P (T ,S | F , R,W ) (1)
Unfortunately, this expression involves a summation over an exponential set ofvalues for the S vector of variables. We solve the inference problem approxi-mately using a Gibbs sampling [18] MCMC method. We begin with some as-signment to the variables. Then, in each Gibbs iteration we first resample all of
Learning Spatial Context 7
the S’s and then resample all the T ’s according to the following two probabilities:
P (Sj | T , F , R, W ) ∝ P (Sj)P (Fj | Sj)∏
i
P (Rij | Ti, Sj) (2)
P (Ti | S, F , R, W ) ∝ P (Ti | Wi)∏
j
P (Rij | Ti, Sj). (3)
These sampling steps can be performed efficiently, as the Ti variables are con-207 207
ditionally independent given the S’s and the Si’s are conditionally independent208 208
given the T ’s. In the last Gibbs iteration for each sample, rather than resampling209 209
T , we compute the posterior probability over T given our current S samples,210 210
and use these distributional particles for our estimate of the probability in (1).211 211
5 Experimental Results212 212
In order to evaluate the ability of the TAS model to learn and utilize context,213 213
we perform experiments on three datasets that differ along several axes. The214 214
first two datasets are from the PASCAL Visual Object Classes challenges 2005215 215
and 2006[19]. The scenes are urban and rural, indoor and outdoor, and there is216 216
a great deal of scale and shape variation amongst the objects. The third is a set217 217
of satellite images acquired from Google Earth. The goal in these images is to218 218
detect cars from above. Because of the impoverished visual information, there219 219
are many false positives when a sliding window detector is applied. In this case,220 220
context provides a filtering mechanism to remove the false positives. Because221 221
these two applications are different, and in order to demonstrate the flexibility222 222
of our approach, we use different detectors for each. In all experiments, we allow223 223
S a cardinality of K = 101, and use 42 features for the image regions that224 224
represent color, texture, and shape [20]. For clusters, we keep a mean and full225 225
covariance matrix over these features. A small regularization (10−6I) is added226 226
to the covariance to ensure positive semi-definiteness.227 227
PASCAL VOC Datasets. For these experiments, we used four classes from228 228
the VOC2005 data, and two classes from the VOC2006 data. The VOC2005229 229
dataset consists of 2232 images, manually annotated with bounding boxes for230 230
four image classes: cars, people, motorbikes, and bicycles. We use the “train+val”231 231
set (684 images) for training, and the “test2” set (859 images) for testing. The232 232
VOC2006 dataset contains 5304 images, manually annotated with 12 classes, of233 233
which we use the cow and sheep classes. We train on the “trainval” set (2618234 234
images) and test on the “test” set (2686 images). To compare with the results235 235
of the challenges, we adopted as our detector the HOG (histogram of oriented236 236
gradients) detector of Dalal and Triggs [10]. This detector uses an SVM and237 237
therefore outputs a score margini ∈ (−∞,+∞), which we convert into a proba-238 238
bility by learning a logistic regression function for P (Ti | margini). We also plot239 239
the precision-recall curve using the code provided in the challenge toolkit.240 240
Because these images are taken parallel to the ground plane, we use relation-241 241
ships that capture axis-aligned interactions at distances relative to size of the242 242
1 Results were robust to a range of K between 5 and 20.
8 Heitz & Koller
(a) (b) (c)
(d) (e) (f)
Fig. 4. (a,b) Example training detections from the bicycle class. The related imageregions are colored by their most likely cluster label. In both examples, the blue regionthat is horizontally offset from the detection belongs to cluster #3. (c) 16 of the topscoring regions for cluster #3, ranked by P (F | S = 3) (likelihood of image features).This cluster corresponds to “roads” or “bushes” as things that are gray/green andoccur next to cars. (d) A case where context helped find a true detection. (e,f) Twoexamples where incorrect detections are filtered out by context.
candidate detected object. We therefore use the following relationships: Rij = 1243 243
(IN) for the region j closest to the center of the detection window; Rij = 2, 3, 4, 5244 244
for the regions that are one bounding box width to the LEFT of, to the RIGHT245 245
of, ABOVE, and BELOW the window; Rij = 6 (NONE) for all other regions.246 246
Figure 4 (top row) shows example bicycle detection candidates, and the re-247 247
lationships they have to the regions in their images. These examples suggest248 248
the type of context that might be learned. For example, the region beside both249 249
detections (colored blue) belongs to cluster #3, which looks visually like a road250 250
or bush cluster. The learned values of the model parameters also indicate that251 251
being to the left or right of this cluster increases the probability of a window252 252
containing a bicycle (e.g., by about 33% in the case where Rij is RIGHT).253 253
We performed a single run of EM learning to convergence, which takes around254 254
2 hours on an Intel Dual Core 1.9 GHz machine with 2 GB of memory. We run255 255
separate experiments for each class, though in principle it would be possible to256 256
learn a single joint model over all classes. By separating the classes, we are able257 257
to isolate the contextual contribution from the stuff, rather than between the dif-258 258
ferent types of things present in the images. For our MCMC inference, we found259 259
that, due to the strength of the baseline detectors, the Markov chain converged260 260
fairly rapidly; we achieved very good results using merely 10 MCMC samples,261 261
Learning Spatial Context 9
0.1 0.2 0.3 0.4 0.5
0.2
0.4
0.6
0.8
1
Recall Rate
Pre
cisi
on
TAS ModelBase DetectorsINRIA-Dalal
0.1 0.2 0.3 0.4 0.5 0.6
0.2
0.4
0.6
0.8
1
Recall Rate
Pre
cisi
on
0.1 0.2 0.3 0.4 0.5
0.2
0.4
0.6
0.8
1TAS ModelBase DetectorsINRIA-Douze
Recall Rate
Pre
cisi
on
(a) Cars (2005) (b) Motorbikes (2005) (e) Cows (2006)
0.1 0.2 0.3 0.4 0.5 0.6
0.2
0.4
0.6
0.8
1
Recall Rate
Pre
cisi
on
0.1 0.2 0.3 0.4
0.2
0.4
0.6
0.8
1
Recall Rate
Pre
cisi
on
0.1 0.2 0.3 0.4 0.5
0.2
0.4
0.6
0.8
1
Recall Rate
Pre
cisi
on
(c) People (2005) (d) Bicycles (2005) (f) Sheep (2006)
Fig. 5. Precision-recall curves for the VOC2005 and VOC2006 classes.
where each is initialized randomly and then undergoes 5 Gibbs iterations. The262 262
entire inference process takes about one second per image.263 263
The bottom row of Figure 4 shows some detections that were corrected using264 264
context. We show one example where a true bicycle was discovered using context,265 265
and two examples where false positives were filtered out by our model. These266 266
examples demonstrate the type of information that is being leveraged by TAS.267 267
In the first example, the sky above, and the road beside the window give a signal268 268
that this detection is at ground level, and is therefore likely to be a bicycle.269 269
Figure 5 shows the full recall-precision curve for each class. For (a-d) we270 270
compare to the 2005 INRIA-Dalal challenge entry, and for (e,f) we compare to271 271
the 2006 INRIA-Douze entry, both of which used the HOG detector. We also272 272
show the curve produced by our Base Detector alone2. Finally, we plot the273 273
curves produced by our TAS Model, trained using full EM, which scores win-274 274
dows using the probability of (1). The model trained using the Pre-Clustered275 275
approach performed similarly. From these curves, we see that the TAS model276 276
provided a significant improvement in accuracy for all but the “people” and277 277
“sheep” classes. We believe the lack of improvement for people is due to the278 278
wide variation of backgrounds in these images, including streets, grass, forests,279 279
2 Differences in PR curves between our base detector and the INRIA-Dalal/INRIA-Douze results come from the use of slightly different training windows and param-eters. We selected algorithm parameters based on the numbers in Everingham etal. [19] and by looking at results on the “train+val” set. We note that a slightchange in parameters raised our “people” results far above the INRIA-Dalal chal-lenge 3 results to the level of their challenge 4 results, where they trained on theirown dataset. Also, INRIA-Dalal did not report results for “bicycle.”
10 Heitz & Koller
deserts, etc. With no strong context cues to latch onto, TAS is unable to improve280 280
on the base HOG detector, which was in fact originally optimized to detect peo-281 281
ple. For sheep, TAS provides an improvement at the low recall rates only. This282 282
may be due to the wide scale variation present in the sheep dataset.283 283
Satellite Images. The second dataset is a set of 30 images extracted from284 284
Google Earth. The images are color, and of size 792 × 636, and contain 1319285 285
manually labeled cars. The average car window is approximately 45× 45 pixels,286 286
and all windows are scaled to these dimensions for training. We used 5-fold cross-287 287
validation, and results below report the mean performance across the folds.288 288
Here, we use a patch-based boosted detector very similar to that of Tor-289 289
ralba [1]. We use 50 rounds of boosting with two level decision trees over patch290 290
cross-correlation features that were computed for 15,000–20,000 rectangular patches291 291
of various aspect ratios and widths from 4 pixels up to 22. Patches are extracted292 292
from the intensity and gradient magnitude images. We learned a single detec-293 293
tor using every positive window, rather than learning detectors for different car294 294
orientations. As above, we convert the boosting score into a probability using295 295
logistic regression. For training the TAS model, we used 10 random restarts of296 296
EM, selecting the parameters that provided the best likelihood of the observed297 297
data. For inference, we need to account for the fact that our detectors are much298 298
weaker, and so more samples are necessary to adequately capture the posterior.299 299
We utilize 100 samples, where each sample undergoes 5 iterations.300 300
Because the in-plane rotation of these images is arbitrary, we need to use301 301
relationships that are more sophisticated than axis-aligned offsets. We observe302 302
that many regions are elongated along roads in the images. Thus, we can use303 303
the region shape to define a local coordinate system that roughly aligns with the304 304
roads. For every candidate detection window, we its containing region, determine305 305
the major and minor axis of that region, and re-project the locations of all other306 306
image regions into this new coordinate system. We then encode the relationships307 307
(in the new coordinate system) of IN (R = 1), 100 pixels ABOVE (R = 2),308 308
BELOW (R = 3), LEFT (R = 4), RIGHT (R = 5), and NONE (R = 6)3.309 309
Figure 6 shows some clusters that are learned with the context model. Eight310 310
of the ten learned clusters are shown, visualized by presenting 16 of the image311 311
regions that rank highest with respect to P (F | S). These clusters have a clear312 312
interpretation: cluster #4, for instance, represents the roofs of houses and cluster313 313
#6 trees and water regions. With each cluster, we also show the probability that314 314
a candidate window contains a car given that it is IN (R = 1) this region. The315 315
parameters provide interesting insights. Clusters #7 and #8 are road clusters,316 316
and both give a nearby window a 80% chance of being a car. Clusters #1 and #7,317 317
however, which represent forest and grass areas drive the probability of nearby318 318
detections down below 2%. Figure 7 shows an example image with the detections319 319
scored by the detector only, and by the TAS model. Many of the false positives320 320
that are not near roads are filtered out by the model.321 321
3 100 pixels represents 2–3 car lengths, which is approximately the range at whichroad-car and building-car interactions occur.
Learning Spatial Context 11
Cluster #1 Cluster #2 Cluster #3 Cluster #4P (car | “In′′) = 0.02 P (car | “In′′) = 0.79 P (car | “In′′) = 0.23 P (car | “In′′) = 0.12
Cluster #5 Cluster #6 Cluster #7 Cluster #8P (car | “In′′) = 0.78 P (car | “In′′) = 0.01 P (car | “In′′) = 0.79 P (car | “In′′) = 0.81
Fig. 6. Clusters learned by the context model on the satellite image dataset. Eachcluster shows 16 of the training image regions that are most likely to be in the clusterbased on P (F | S). For each cluster, we also show the probability that a high-scoringwindow (according to the detector) in a region labeled by that cluster contains a car.
Here, there are many detections per image, so we plot the recall versus the322 322
number of false detections per image in Figure 8. We compare the Base De-323 323
tectors to the TAS Model trained using the Full TAS learning method.324 324
We found that the Full TAS learning outperformed the Pre-Clustered TAS325 325
learning, raising the recall rate by an average of 2.2% for any given FPPI, and326 326
a max of 4.7% for 1.2 FPPI. This demonstrates that some of the benefit from327 327
the model comes from the joint learning phase. The curves verify that context328 328
indeed improves our results by filtering out false positives.329 329
6 Discussion and Future Directions330 330
In this paper, we have presented the TAS model, a probabilistic framework that331 331
captures the contextual information between “stuff” and “things”, by linking332 332
discriminative detection of objects with unsupervised clustering of image regions.333 333
Importantly, the method does not require extensive labeling of image regions;334 334
standard labeling of object bounding boxes suffices for learning a model of the335 335
appearance of stuff regions and their contextual cues. We have demonstrated336 336
that the TAS model improves the performance even of strong base classifiers,337 337
including one of the top performing detectors in the PASCAL challenge.338 338
The flexibility of the TAS model provides several important benefits. The339 339
model can accommodate almost any choice of object detector that produces a340 340
score for a window that is monotonically increasing with the likelihood of the341 341
object being in the window. It is also flexible to many classes of region features342 342
12 Heitz & Koller
(a) Base Detectors (b) Region Labels (c) TAS Detections
Fig. 7. An example satellite image, with detections found by the base detector (a), andby the TAS model (c) with a threshold of 0.15. The TAS model successfully filters outmany of the false positives that occur far away from roads, but many of the remainingfalse positives are “in context,” and thus cannot be fixed. Region labels (b) show thathouses (in purple) are correctly grouped, and able to provide context for the cars.
Fig. 8. A plot of recall rate vs. false posi-tives per image for the satellite data. Theresults here are averaged across 5 folds, andshow a significant improvement from usingTAS over the base detectors.
0 40 80 120 160
0.2
0.4
0.6
0.8
1
False Positives Per Image
Rec
all R
ate
Base DetectorTAS Model
coupled with any generative model over these features. For instance, we might343 343
want to pre-cluster the regions into so-called visual words, and then use a cluster-344 344
dependent multinomial distribution over these words [20]. These characteristics345 345
also make the model applicable to a wide range of problems in object detection.346 346
Because the image region clusters are learned in an unsupervised fashion,347 347
they are able to capture a wide range of possible concepts. While a human might348 348
label the regions in one way (say trees and buildings), the automatic learning349 349
procedure might find a more contextually relevant grouping. For instance, the350 350
TAS model might split buildings into two categories: apartments, which often351 351
have cars parked near them, and factories, which rarely co-occur with cars.352 352
As discussed in Section 2, recent work has amply demonstrated the impor-353 353
tance of context in computer vision. The type of context modeled by the TAS354 354
framework is a natural complement for many of the other types of context in355 355
the literature. In particular, while many other forms of context can relate known356 356
objects that have been labeled in the data, our model can extract the signals357 357
present in the unlabeled part of the data.358 358
Learning Spatial Context 13
A major limitation of the TAS model is that it captures only 2D context. This359 359
issue also affects our ability to determine the appropriate scale for the contextual360 360
relationships. It would be interesting to integrate a TAS-like definition of context361 361
into an approach that attempts some level of 3D reconstruction, such as the work362 362
of Hoiem and Efros [15] or of Saxena et al. [21], allowing us to utilize 3D context,363 363
and simultaneously address the issue of scale.364 364
References365 365
[1] Torralba, A. Contextual priming for object detection. IJCV 53, 2003.366 366
[2] Viola, P., Jones, M.: Robust real-time face detection. ICCV, 2001.367 367
[3] Shotton, J., Winn, J., Rother, C., Criminisi, A. Textonboost: Joint appearance,368 368
shape and context modeling for multi-class object recognition and segmentation.369 369
ECCV, 2006.370 370
[4] Forsyth, D.A., Malik, J., Fleck, M.M., Greenspan, H., Leung, T.K., Belongie, S.,371 371
Carson, C., Bregler, C. Finding pictures of objects in large collections of images.372 372
Object Representation in Computer Vision, 1996373 373
[5] Murphy, K., Torralba, A., Freeman, W. Using the forest to see the tree: a graphical374 374
model relating features, objects and the scenes NIPS, 2003.375 375
[6] Singhal, A., Luo, J., Zhu, W. Probabilistic spatial context models for scene content376 376
understanding. CVPR, 2003.377 377
[7] Rabinovich, A, Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S. Objects378 378
in context. ICCV, 2007.379 379
[8] Kumar, S., Hebert, M.: A hierarchical field framework for unified context-based380 380
classification. ICCV, 2005.381 381
[9] Carbonetto, P., de Freitas, N., Barnard, K. A statistical model for general con-382 382
textual object recognition ECCV, 2004.383 383
[10] Dalal, N., Triggs, B. Histograms of oriented gradients for human detection. CVPR,384 384
2005.385 385
[11] Oliva, A., Torralba, A. The role of context in object recognition. Trends Cogn386 386
Sci, 2007.387 387
[12] Torralba, A., Murphy, K., Freeman, W., Rubin, M.: Context-based vision system388 388
for place and object recognition ICCV, 2003.389 389
[13] Wolf, L., Bileschi, S. A critical view of context. IJCV 69, 2006.390 390
[14] Fink, M., Perona, P. Mutual boosting for contextual inference. NIPS, 2003.391 391
[15] Hoiem, D., Efros, A.A., Hebert, M. Putting objects in perspective. CVPR, 2006.392 392
[16] Ren, X., Malik, J. Learning a classification model for segmentation. ICCV, 2003.393 393
[17] Dempster, A.P., Laird, N.M., Rubin, D.B. Maximum likelihood from incomplete394 394
data via the em algorithm. JRSS, 1977.395 395
[18] Geman, S., Geman, D. Stochastic relaxation, gibbs distributions, and the bayesian396 396
restoration of images. Readings in computer vision: issues, problems, principles,397 397
and paradigms. (1987) 564–584398 398
[19] Everingham, M., et al. The 2005 pascal visual object classes challenge. MLCW,399 399
2005.400 400
[20] Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D.M., Jordan, M.I.401 401
Matching words and pictures. JMLR 3, 2003.402 402
[21] Saxena, A., Sun, M., Ng, A.Y. Learning 3-d scene structure from a single still403 403
image. ICCV, 2007.404 404