Learning Spatial Context: 1 Using Stuﬀ to Find Things...

Learning Spatial Context:1 1

Using Stuff to Find Things2 2

Geremy Heitz Daphne Koller3 3

Department of Computer Science4 4

Stanford University5 5

Stanford, CA 943056 6

Abstract. The sliding window approach of detecting rigid objects (such7 7

as cars) is predicated on the belief that the object can be identified8 8

from the appearance in a defined region around the object. Other types9 9

of objects of amorphous spatial extent (e.g., trees, sky), however, are10 10

more naturally classified based on texture or color. In this paper, we11 11

seek to combine recognition of these two types of objects into a system12 12

that leverages “context” toward improving detection. In particular, we13 13

cluster regions of the image based on their ability to serve as context14 14

for the detection of objects. Rather than providing an explicit training15 15

set with region labels, our method automatically groups regions based16 16

on both their appearance and their relationships to the detections in the17 17

image. We show that our things and stuff (TAS) context model produces18 18

meaningful clusters that are readily interpretable, and helps improve our19 19

detection ability over state-of-the-art detectors. We present results on20 20

object detection in images from the PASCAL VOC 2005/2006 datasets21 21

and on the task of overhead car detection in satellite images.22 22

1 Introduction23 23

Recognizing objects in an image requires combining many different signals from24 24

the raw image data. Figure 1 shows an example satellite image of a street scene,25 25

where we may want to identify all of the cars. From a human perspective, there26 26

are two primary signals that we leverage. The first is the local appearance in the27 27

window near the potential car. The second is our knowledge that cars appear28 28

on roads. This second signal is a form of contextual knowledge, and our goal in29 29

this paper is to capture this idea in a rigorous probabilistic model.30 30

Recent papers have demonstrated that boosted object detectors can be effec-31 31

tive at detecting monolithic object classes, such as cars [1] or faces [2]. Similarly,32 32

several works on multiclass segmentation have shown that regions in images33 33

can effectively be classified based on their color or texture properties [3]. These34 34

two lines of work have made significant progress on the problems of identifying35 35

“things” and “stuff,” respectively. The important differentiation between these36 36

two classes of visual objects is summarized in Forsyth et al. [4] as:37 37

The distinction between materials — “stuff” — and objects — “things”38 38

— is particularly important. A material is defined by a homogeneous or39 39

repetitive pattern of fine-scale properties, but has no specific or distinc-40 40

tive spatial extent or shape. An object has a specific size and shape.41 41

2 Heitz & Koller

Fig. 1. (Left) An aerial photograph. (Center) Detected cars in the image (solid green= true detections, dashed red = false detections). (Right) Finding “stuff” such asbuildings, by classifying regions, shown delineated by red boundaries.

Recent work has also shown that classifiers targeting particular classes of42 42

things or stuff can benefit from the proper use of contextual cues. The use of43 43

context can be split into a number of categories. Scene-Thing context allows44 44

scene-level information, such as the scale or the “gist” [5], to determine prior45 45

location probabilities for the various objects. Stuff-Stuff context captures the no-46 46

tion that sky occurs above sea and road below building [6]. Thing-Thing context47 47

considers the co-occurrence of objects such as the notion that a tennis racket is48 48

more likely to occur with a tennis ball than with a lemon [7]. Finally Stuff-Thing49 49

context allows the texture regions (e.g., the roads and buildings in Figure 1) to50 50

add predictive power to the detection of objects (e.g., the cars in Figure 1). We51 51

focus on this fourth type of context. Figure 2 shows an example of this context52 52

in the case of satellite imagery.53 53

In this paper, we present a probabilistic model linking the detection of things54 54

to the unsupervised classification of stuff. Our method can be viewed as an at-55 55

tempt to cluster “stuff,” represented by coherent image regions, into clusters56 56

that are both visually similar and best able to provide context for the detectable57 57

“things” in the image. Cluster labels for the image regions are probabilistically58 58

linked to the detection window labels through the use of region-detection “rela-59 59

tionships,” which encode their relative spatial locations. We call this model the60 60

things and stuff (TAS) context model because it links these two components into61 61

a coherent whole. The graphical representation of the TAS model is depicted in62 62

Figure 3. At training time, this model leverages supervised (groundtruth) de-63 63

tection labels to learn parameters using the Expectation-Maximization (EM)64 64

algorithm. For test instances, both the region labels and detection labels are65 65

inferred from the model.66 66

We present results of the TAS method on diverse datasets. Using the Pascal67 67

Visual Object Classes challenge datasets from 2005 and 2006 (VOC2005 and68 68

VOC2006), we utilize one of the top competitors as our baseline detector [10],69 69

and demonstrate that our TAS method improves the performance of detecting70 70

cars, bicycles and motorbikes in street scenes and cows and sheep in rural scenes71 71

(see Figure 4). In addition, we consider a very different dataset of satellite images72 72

from Google Earth, of the city and suburbs of Brussels, Belgium. The task here73 73

is to identify the cars from above; see Figure 2. For clarity, the satellite data is74 74

used as the running example throughout the paper, but all descriptions apply75 75

equally to the other datasets.76 76

Learning Spatial Context 3

Fig. 2. Example detections from the satellite dataset that demonstrate context. Clas-sifying using local appearance only, we might think that both windows at left are cars.However, when seen in context, the bottom detection is unlikely to be an actual car.

2 Related Work77 77

The role of context in object recognition has become an important topic, due78 78

both to the psychological basis of context in the human visual system [11] and to79 79

the striking algorithmic improvements that “visual context” has provided [12].80 80

The word “context” has been attached to many different ideas. One of the81 81

simplest forms is co-occurrence context. The work of Rabinovich et al. [7] demon-82 82

strates the use of this context, where the presence of a certain object class in an83 83

image probabilistically influences the presence of a second class. The context of84 84

Torralba et al. [12] assumes that certain objects occur more frequently in cer-85 85

tain rooms, as monitors tend to occur in offices. While these methods achieve86 86

excellent results when many different object classes are labeled per image, they87 87

are unable to leverage unsupervised data for contextual object recognition.88 88

In addition to co-occurrence context, many approaches take into account89 89

the spatial relationships between objects. At the local descriptor level, Wolf et90 90

al. [13] detect objects using a descriptor with a large capture range, allowing the91 91

detection of the object to be influenced by surrounding image features. Because92 92

these methods use only the raw features for context, however, they cannot obtain93 93

a holistic view of an entire scene. Similarly, Fink and Perona [14] use the output94 94

of boosted detectors for other classes as additional features for the detection of95 95

a given class. This allows the inclusion of signal beyond the raw features, but96 96

requires that all “parts” of a scene be supervised. In contrast, Murphy et al. [5]97 97

use a global feature known as the “gist” to learn statistical priors on the locations98 98

of various objects within the context of the specific scene. The gist descriptor99 99

is excellent at predicting the scene type and large structures in the scene, but100 100

cannot handle the local interactions present in the satellite data, for example.101 101

Another approach to modeling spatial relationships is to use a Markov Ran-102 102

dom Field (MRF) or variant (CRF,DRF) [8, 9] to encode the preferences for cer-103 103

tain spatial relationships. These techniques offer a great deal of flexibility in the104 104

formulation of the affinity function and all the standard benefits of a graphical105 105

model formulation (e.g., well-known learning and inference techniques). Singhal106 106

et al. [6] also use similar concepts to the MRF formulation to aggregate de-107 107

cisions across the image. These methods, however, suffer from two drawbacks.108 108

First, they tend to require a large amount of annotation in the training set. Sec-109 109

ond, they put things and stuff on the same footing, representing both as “sites”110 110

4 Heitz & Koller

in the MRF. Our method requires less annotation and allows detections and111 111

image regions to be represented in their (different) natural spaces.112 112

Perhaps the most ambitious attempts to use context involves the attempt to113 113

model the scene of an image holistically. Torralba [1], for instance, uses global114 114

image features to “prime” the detector with the likely presence/absence of ob-115 115

jects, the likely locations, and the likely scales. The work of Hoiem and Efros [15]116 116

takes this one level further by explicitly modeling the 3D layout of the scene.117 117

This allows the natural use of scale and location constraints (e.g., things closer118 118

to the camera are larger). Their approach, however, is tailored to street scenes,119 119

and requires domain-specific modeling. The specific form of their priors would120 120

be useless in the case of satellite images, for example.121 121

3 Things and Stuff (TAS) Context Model122 122

Our probabilistic context model builds on two standard components that are123 123

commonly used in the literature. The first is sliding window object detection,124 124

and the second is unsupervised image region clustering. A common method for125 125

finding “things” is to slide a window over the image, score each window’s match126 126

to the model, and return the highest matching such windows. We denote the127 127

features in the ith candidate window by Wi, the presence/absence of the target128 128

class in that window by Ti (T for “thing”), and assume that what we learn in129 129

our detector is a conditional probability P (Ti | Wi) from the window features130 130

to the probability that the window contains the object; this probability can be131 131

derived from most standard classifiers, such as the highly successful boosting132 132

approaches [1]. The set of windows included in the model can, in principle,133 133

include all windows in the image. We, however, limit ourselves to looking at the134 134

top scoring detection windows according to the detector (i.e., all windows above135 135

some low threshold, chosen in order to capture most of the true positives).136 136

The second component we build on involves clustering coherent regions of the137 137

image into groups based on appearance. We begin by segmenting the image into138 138

regions, known as superpixels, using the normalized cut algorithm of Ren and139 139

Malik [16]. For each region j, we extract a feature vector F j that includes color140 140

and texture features. For our stuff model, we will use a generative model where141 141

each region has a hidden class, and the features are generated from a Gaussian142 142

distribution with parameters depending on the class. Denote by Sj (S for “stuff”)143 143

the (hidden) class label of the jth region. We assume that the features are derived144 144

from a standard naive Bayes model, where we have a probability distribution145 145

P (F j | Sj) of the image features given the class label.146 146

In order to relate the detector for “things” (T ’s) to the clustering model over147 147

“stuff” (S’s), we need to develop an intuitive representation for the relationships148 148

between these units. Human knowledge in this area comes in sentences like “cars149 149

drive on roads,” or “cars park 20 feet from buildings.” We introduce variables150 150

Rij that represent the relationship between candidate detection i and region151 151

j. The values of Rij indicate different relationships, such as: “detection i is in152 152

region j”, or “detection i is about 100 pixels away from region j”.153 153

We can now link our component models into a single coherent probabilistic154 154

things and stuff (TAS) model, as depicted in the plate model of Figure 3(a).155 155


RijTi Sj

Fj

ImageWindow

Wi

Wi: Window

Ti: Object Presence

Sj: Region Label

Fj: Region Features

Rij: Relationship

N

J

(a) TAS plate model

T1

S1

S2

S3

S4

S5

T2

T3

R21 = “Above”

R31 = “Left”

R13 = “In”

R33 = “In”

R11 = “Left”

CandidateWindows

ImageRegions

(b) TAS ground network

Fig. 3. The TAS model. The plate representation (a) gives a compact visualization ofthe model, which unrolls into a “ground” network (b) for any particular image.

Probabilistic influence flows between the detection window labels and the image156 156

region labels through the v-structures that are activated by observing the rela-157 157

tionship variables. For a particular input image, this plate model unrolls into a158 158

“ground” network that encodes a distribution over the detections and regions of159 159

the image. Figure 3(b) shows a toy example of a ground network for the image of160 160

Figure 1. It is interesting to note the similarities between TAS and the MRF ap-161 161

proaches in the literature. In effect, the relationship variables link the detections162 162

and the regions into a probabilistic web where signals at one point in the web163 163

(say the strong appearance of a particular region) influence a detection in an-164 164

other part of the image, which in turn influences the label of a different region in165 165

yet another location. In the example of Figure 3(b), if region 3 has strong road166 166

appearance, it might influence T1 through the relationship R13. This in turn167 167

will influence the labeling of region 1 through the relationship R11. Because our168 168

method uses a generative model for the region labels, training can be performed169 169

very effectively, even when the Sj variables are unobserved (see below).170 170

All variables in the TAS model are discrete except for the feature F j vari-ables. This allows for simple table conditional probability distributions (CPDs)for all discrete nodes in this Bayesian network. The probability distribution overthese variables decomposes according to:

P (TSFRW ) =∏

i

P (Wi)P (Ti | Wi)∏

j

P (Sj)P (F j | Sj)∏

ij

P (Rij | Ti, Sj).

171 171One of the main benefits of the TAS approach is its simplicity and modularity.172 172

It allows us to “plug in” any sliding window detector and any generative approach173 173

for region clustering (e.g., [3]).174 174

6 Heitz & Koller

4 Learning and Inference in the TAS Model175 175

Because TAS unrolls into a Bayesian network for each image, we can use standard176 176

learning and inference methods. In particular, we learn the parameters of our177 177

model using the Expectation-Maximization (EM) [17] algorithm and perform178 178

inference using Gibbs sampling, a standard variant of MCMC sampling [18].179 179

Learning the Model with EM. At learning time, we have a set of images180 180

with annotated labels for our target object class(es). We first train the base181 181

detectors using this set, and select as our candidate windows all detections above182 182

a threshold; we thus obtain a set of candidate detection windows W1 . . .WN183 183

along with their groundtruth labels T1 . . . TN . We also have a set of regions and184 184

a feature vector Fj for each, and an Rij relationship variable for every window-185 185

region pair. Our goal is to learn parameters for the TAS model.186 186

One option is to learn in two phases: we first use an existing clustering method187 187

(such as K-means or Mixture-of-Gaussian clustering) to assign the Sj variables188 188

in the training data; we then learn parameters for the TAS model with fully ob-189 189

served data. This approach is attractive as it allows the first part of the learning190 190

to be fully unsupervised, enabling us to use more images than exist in our labeled191 191

dataset. We call the model learned with this method Pre-Clustered TAS, but192 192

while the resulting clusters are likely to be visually coherent, they do not exploit193 193

the spatial relationships between the regions and the object detections.194 194

We therefore propose a joint training method where we perform EM in the195 195

full model; this model is called the Full TAS model. The EM algorithm iterates196 196

between using probabilistic inference to derive a soft completion of the hidden197 197

variables (E-step) and finding maximum-likelihood parameters relative to this198 198

soft completion (M-step). The E-step here is particularly easy: at training time,199 199

only the Sj ’s are unobserved; moreover, because the T variables are observed,200 200

the Sj ’s are conditionally independent of each other. Thus, the inference step201 201

turns into a simple computation for each Sj separately, a process which can be202 202

performed in linear time. The M-step for table CPDs can be performed easily in203 203

closed form. To provide a good starting point for EM, we initialize the cluster204 204

assignments using the K-means algorithm. EM is guaranteed to converge to a205 205

local maximum of the likelihood function of the observed data.206 206

Inference with Gibbs Sampling. At test time, our system must determinewhich windows in a new image contain the target object. We observe the can-didate detection windows (Wi’s, extracted by thresholding the base detectoroutput), the features of each image region (F j ’s), and the relationships (Rij ’s).Our task is to find the probability that each window contains the object:

P (T | F , R, W ) =∑

S

P (T ,S | F , R,W ) (1)

Unfortunately, this expression involves a summation over an exponential set ofvalues for the S vector of variables. We solve the inference problem approxi-mately using a Gibbs sampling [18] MCMC method. We begin with some as-signment to the variables. Then, in each Gibbs iteration we first resample all of


the S’s and then resample all the T ’s according to the following two probabilities:

P (Sj | T , F , R, W ) ∝ P (Sj)P (Fj | Sj)∏

i

P (Rij | Ti, Sj) (2)

P (Ti | S, F , R, W ) ∝ P (Ti | Wi)∏

j

P (Rij | Ti, Sj). (3)

These sampling steps can be performed efficiently, as the Ti variables are con-207 207

ditionally independent given the S’s and the Si’s are conditionally independent208 208

given the T ’s. In the last Gibbs iteration for each sample, rather than resampling209 209

T , we compute the posterior probability over T given our current S samples,210 210

and use these distributional particles for our estimate of the probability in (1).211 211

5 Experimental Results212 212

In order to evaluate the ability of the TAS model to learn and utilize context,213 213

we perform experiments on three datasets that differ along several axes. The214 214

first two datasets are from the PASCAL Visual Object Classes challenges 2005215 215

and 2006[19]. The scenes are urban and rural, indoor and outdoor, and there is216 216

a great deal of scale and shape variation amongst the objects. The third is a set217 217

of satellite images acquired from Google Earth. The goal in these images is to218 218

detect cars from above. Because of the impoverished visual information, there219 219

are many false positives when a sliding window detector is applied. In this case,220 220

context provides a filtering mechanism to remove the false positives. Because221 221

these two applications are different, and in order to demonstrate the flexibility222 222

of our approach, we use different detectors for each. In all experiments, we allow223 223

S a cardinality of K = 101, and use 42 features for the image regions that224 224

represent color, texture, and shape [20]. For clusters, we keep a mean and full225 225

covariance matrix over these features. A small regularization (10−6I) is added226 226

to the covariance to ensure positive semi-definiteness.227 227

PASCAL VOC Datasets. For these experiments, we used four classes from228 228

the VOC2005 data, and two classes from the VOC2006 data. The VOC2005229 229

dataset consists of 2232 images, manually annotated with bounding boxes for230 230

four image classes: cars, people, motorbikes, and bicycles. We use the “train+val”231 231

set (684 images) for training, and the “test2” set (859 images) for testing. The232 232

VOC2006 dataset contains 5304 images, manually annotated with 12 classes, of233 233

which we use the cow and sheep classes. We train on the “trainval” set (2618234 234

images) and test on the “test” set (2686 images). To compare with the results235 235

of the challenges, we adopted as our detector the HOG (histogram of oriented236 236

gradients) detector of Dalal and Triggs [10]. This detector uses an SVM and237 237

therefore outputs a score margini ∈ (−∞,+∞), which we convert into a proba-238 238

bility by learning a logistic regression function for P (Ti | margini). We also plot239 239

the precision-recall curve using the code provided in the challenge toolkit.240 240

Because these images are taken parallel to the ground plane, we use relation-241 241

ships that capture axis-aligned interactions at distances relative to size of the242 242

1 Results were robust to a range of K between 5 and 20.

8 Heitz & Koller

(a) (b) (c)

(d) (e) (f)

Fig. 4. (a,b) Example training detections from the bicycle class. The related imageregions are colored by their most likely cluster label. In both examples, the blue regionthat is horizontally offset from the detection belongs to cluster #3. (c) 16 of the topscoring regions for cluster #3, ranked by P (F | S = 3) (likelihood of image features).This cluster corresponds to “roads” or “bushes” as things that are gray/green andoccur next to cars. (d) A case where context helped find a true detection. (e,f) Twoexamples where incorrect detections are filtered out by context.

candidate detected object. We therefore use the following relationships: Rij = 1243 243

(IN) for the region j closest to the center of the detection window; Rij = 2, 3, 4, 5244 244

for the regions that are one bounding box width to the LEFT of, to the RIGHT245 245

of, ABOVE, and BELOW the window; Rij = 6 (NONE) for all other regions.246 246

Figure 4 (top row) shows example bicycle detection candidates, and the re-247 247

lationships they have to the regions in their images. These examples suggest248 248

the type of context that might be learned. For example, the region beside both249 249

detections (colored blue) belongs to cluster #3, which looks visually like a road250 250

or bush cluster. The learned values of the model parameters also indicate that251 251

being to the left or right of this cluster increases the probability of a window252 252

containing a bicycle (e.g., by about 33% in the case where Rij is RIGHT).253 253

We performed a single run of EM learning to convergence, which takes around254 254

2 hours on an Intel Dual Core 1.9 GHz machine with 2 GB of memory. We run255 255

separate experiments for each class, though in principle it would be possible to256 256

learn a single joint model over all classes. By separating the classes, we are able257 257

to isolate the contextual contribution from the stuff, rather than between the dif-258 258

ferent types of things present in the images. For our MCMC inference, we found259 259

that, due to the strength of the baseline detectors, the Markov chain converged260 260

fairly rapidly; we achieved very good results using merely 10 MCMC samples,261 261


0.1 0.2 0.3 0.4 0.5

0.2

0.4

0.6

0.8

1

Recall Rate

Pre

cisi

on

TAS ModelBase DetectorsINRIA-Dalal

0.1 0.2 0.3 0.4 0.5 0.6

0.2

0.4

0.6

0.8

1

Recall Rate

Pre

cisi

on

0.1 0.2 0.3 0.4 0.5

0.2

0.4

0.6

0.8

1TAS ModelBase DetectorsINRIA-Douze

Recall Rate

Pre

cisi

on

(a) Cars (2005) (b) Motorbikes (2005) (e) Cows (2006)

0.1 0.2 0.3 0.4 0.5 0.6

0.2

0.4

0.6

0.8

1

Recall Rate

Pre

cisi

on

0.1 0.2 0.3 0.4

0.2

0.4

0.6

0.8

1

Recall Rate

Pre

cisi

on

0.1 0.2 0.3 0.4 0.5

0.2

0.4

0.6

0.8

1

Recall Rate

Pre

cisi

on

(c) People (2005) (d) Bicycles (2005) (f) Sheep (2006)

Fig. 5. Precision-recall curves for the VOC2005 and VOC2006 classes.

where each is initialized randomly and then undergoes 5 Gibbs iterations. The262 262

entire inference process takes about one second per image.263 263

The bottom row of Figure 4 shows some detections that were corrected using264 264

context. We show one example where a true bicycle was discovered using context,265 265

and two examples where false positives were filtered out by our model. These266 266

examples demonstrate the type of information that is being leveraged by TAS.267 267

In the first example, the sky above, and the road beside the window give a signal268 268

that this detection is at ground level, and is therefore likely to be a bicycle.269 269

Figure 5 shows the full recall-precision curve for each class. For (a-d) we270 270

compare to the 2005 INRIA-Dalal challenge entry, and for (e,f) we compare to271 271

the 2006 INRIA-Douze entry, both of which used the HOG detector. We also272 272

show the curve produced by our Base Detector alone2. Finally, we plot the273 273

curves produced by our TAS Model, trained using full EM, which scores win-274 274

dows using the probability of (1). The model trained using the Pre-Clustered275 275

approach performed similarly. From these curves, we see that the TAS model276 276

provided a significant improvement in accuracy for all but the “people” and277 277

“sheep” classes. We believe the lack of improvement for people is due to the278 278

wide variation of backgrounds in these images, including streets, grass, forests,279 279

2 Differences in PR curves between our base detector and the INRIA-Dalal/INRIA-Douze results come from the use of slightly different training windows and param-eters. We selected algorithm parameters based on the numbers in Everingham etal. [19] and by looking at results on the “train+val” set. We note that a slightchange in parameters raised our “people” results far above the INRIA-Dalal chal-lenge 3 results to the level of their challenge 4 results, where they trained on theirown dataset. Also, INRIA-Dalal did not report results for “bicycle.”

10 Heitz & Koller

deserts, etc. With no strong context cues to latch onto, TAS is unable to improve280 280

on the base HOG detector, which was in fact originally optimized to detect peo-281 281

ple. For sheep, TAS provides an improvement at the low recall rates only. This282 282

may be due to the wide scale variation present in the sheep dataset.283 283

Satellite Images. The second dataset is a set of 30 images extracted from284 284

Google Earth. The images are color, and of size 792 × 636, and contain 1319285 285

manually labeled cars. The average car window is approximately 45× 45 pixels,286 286

and all windows are scaled to these dimensions for training. We used 5-fold cross-287 287

validation, and results below report the mean performance across the folds.288 288

Here, we use a patch-based boosted detector very similar to that of Tor-289 289

ralba [1]. We use 50 rounds of boosting with two level decision trees over patch290 290

cross-correlation features that were computed for 15,000–20,000 rectangular patches291 291

of various aspect ratios and widths from 4 pixels up to 22. Patches are extracted292 292

from the intensity and gradient magnitude images. We learned a single detec-293 293

tor using every positive window, rather than learning detectors for different car294 294

orientations. As above, we convert the boosting score into a probability using295 295

logistic regression. For training the TAS model, we used 10 random restarts of296 296

EM, selecting the parameters that provided the best likelihood of the observed297 297

data. For inference, we need to account for the fact that our detectors are much298 298

weaker, and so more samples are necessary to adequately capture the posterior.299 299

We utilize 100 samples, where each sample undergoes 5 iterations.300 300

Because the in-plane rotation of these images is arbitrary, we need to use301 301

relationships that are more sophisticated than axis-aligned offsets. We observe302 302

that many regions are elongated along roads in the images. Thus, we can use303 303

the region shape to define a local coordinate system that roughly aligns with the304 304

roads. For every candidate detection window, we its containing region, determine305 305

the major and minor axis of that region, and re-project the locations of all other306 306

image regions into this new coordinate system. We then encode the relationships307 307

(in the new coordinate system) of IN (R = 1), 100 pixels ABOVE (R = 2),308 308

BELOW (R = 3), LEFT (R = 4), RIGHT (R = 5), and NONE (R = 6)3.309 309

Figure 6 shows some clusters that are learned with the context model. Eight310 310

of the ten learned clusters are shown, visualized by presenting 16 of the image311 311

regions that rank highest with respect to P (F | S). These clusters have a clear312 312

interpretation: cluster #4, for instance, represents the roofs of houses and cluster313 313

#6 trees and water regions. With each cluster, we also show the probability that314 314

a candidate window contains a car given that it is IN (R = 1) this region. The315 315

parameters provide interesting insights. Clusters #7 and #8 are road clusters,316 316

and both give a nearby window a 80% chance of being a car. Clusters #1 and #7,317 317

however, which represent forest and grass areas drive the probability of nearby318 318

detections down below 2%. Figure 7 shows an example image with the detections319 319

scored by the detector only, and by the TAS model. Many of the false positives320 320

that are not near roads are filtered out by the model.321 321

3 100 pixels represents 2–3 car lengths, which is approximately the range at whichroad-car and building-car interactions occur.


Cluster #1 Cluster #2 Cluster #3 Cluster #4P (car | “In′′) = 0.02 P (car | “In′′) = 0.79 P (car | “In′′) = 0.23 P (car | “In′′) = 0.12

Cluster #5 Cluster #6 Cluster #7 Cluster #8P (car | “In′′) = 0.78 P (car | “In′′) = 0.01 P (car | “In′′) = 0.79 P (car | “In′′) = 0.81

Fig. 6. Clusters learned by the context model on the satellite image dataset. Eachcluster shows 16 of the training image regions that are most likely to be in the clusterbased on P (F | S). For each cluster, we also show the probability that a high-scoringwindow (according to the detector) in a region labeled by that cluster contains a car.

Here, there are many detections per image, so we plot the recall versus the322 322

number of false detections per image in Figure 8. We compare the Base De-323 323

tectors to the TAS Model trained using the Full TAS learning method.324 324

We found that the Full TAS learning outperformed the Pre-Clustered TAS325 325

learning, raising the recall rate by an average of 2.2% for any given FPPI, and326 326

a max of 4.7% for 1.2 FPPI. This demonstrates that some of the benefit from327 327

the model comes from the joint learning phase. The curves verify that context328 328

indeed improves our results by filtering out false positives.329 329

6 Discussion and Future Directions330 330

In this paper, we have presented the TAS model, a probabilistic framework that331 331

captures the contextual information between “stuff” and “things”, by linking332 332

discriminative detection of objects with unsupervised clustering of image regions.333 333

Importantly, the method does not require extensive labeling of image regions;334 334

standard labeling of object bounding boxes suffices for learning a model of the335 335

appearance of stuff regions and their contextual cues. We have demonstrated336 336

that the TAS model improves the performance even of strong base classifiers,337 337

including one of the top performing detectors in the PASCAL challenge.338 338

The flexibility of the TAS model provides several important benefits. The339 339

model can accommodate almost any choice of object detector that produces a340 340

score for a window that is monotonically increasing with the likelihood of the341 341

object being in the window. It is also flexible to many classes of region features342 342

12 Heitz & Koller

(a) Base Detectors (b) Region Labels (c) TAS Detections

Fig. 7. An example satellite image, with detections found by the base detector (a), andby the TAS model (c) with a threshold of 0.15. The TAS model successfully filters outmany of the false positives that occur far away from roads, but many of the remainingfalse positives are “in context,” and thus cannot be fixed. Region labels (b) show thathouses (in purple) are correctly grouped, and able to provide context for the cars.

Fig. 8. A plot of recall rate vs. false posi-tives per image for the satellite data. Theresults here are averaged across 5 folds, andshow a significant improvement from usingTAS over the base detectors.

0 40 80 120 160

0.2

0.4

0.6

0.8

1

False Positives Per Image

Rec

all R

ate

Base DetectorTAS Model

coupled with any generative model over these features. For instance, we might343 343

want to pre-cluster the regions into so-called visual words, and then use a cluster-344 344

dependent multinomial distribution over these words [20]. These characteristics345 345

also make the model applicable to a wide range of problems in object detection.346 346

Because the image region clusters are learned in an unsupervised fashion,347 347

they are able to capture a wide range of possible concepts. While a human might348 348

label the regions in one way (say trees and buildings), the automatic learning349 349

procedure might find a more contextually relevant grouping. For instance, the350 350

TAS model might split buildings into two categories: apartments, which often351 351

have cars parked near them, and factories, which rarely co-occur with cars.352 352

As discussed in Section 2, recent work has amply demonstrated the impor-353 353

tance of context in computer vision. The type of context modeled by the TAS354 354

framework is a natural complement for many of the other types of context in355 355

the literature. In particular, while many other forms of context can relate known356 356

objects that have been labeled in the data, our model can extract the signals357 357

present in the unlabeled part of the data.358 358


A major limitation of the TAS model is that it captures only 2D context. This359 359

issue also affects our ability to determine the appropriate scale for the contextual360 360

relationships. It would be interesting to integrate a TAS-like definition of context361 361

into an approach that attempts some level of 3D reconstruction, such as the work362 362

of Hoiem and Efros [15] or of Saxena et al. [21], allowing us to utilize 3D context,363 363

and simultaneously address the issue of scale.364 364

References365 365

[1] Torralba, A. Contextual priming for object detection. IJCV 53, 2003.366 366

[2] Viola, P., Jones, M.: Robust real-time face detection. ICCV, 2001.367 367

[3] Shotton, J., Winn, J., Rother, C., Criminisi, A. Textonboost: Joint appearance,368 368

shape and context modeling for multi-class object recognition and segmentation.369 369

ECCV, 2006.370 370

[4] Forsyth, D.A., Malik, J., Fleck, M.M., Greenspan, H., Leung, T.K., Belongie, S.,371 371

Carson, C., Bregler, C. Finding pictures of objects in large collections of images.372 372

Object Representation in Computer Vision, 1996373 373

[5] Murphy, K., Torralba, A., Freeman, W. Using the forest to see the tree: a graphical374 374

model relating features, objects and the scenes NIPS, 2003.375 375

[6] Singhal, A., Luo, J., Zhu, W. Probabilistic spatial context models for scene content376 376

understanding. CVPR, 2003.377 377

[7] Rabinovich, A, Vedaldi, A., Galleguillos, C., Wiewiora, E., Belongie, S. Objects378 378

in context. ICCV, 2007.379 379

[8] Kumar, S., Hebert, M.: A hierarchical field framework for unified context-based380 380

classification. ICCV, 2005.381 381

[9] Carbonetto, P., de Freitas, N., Barnard, K. A statistical model for general con-382 382

textual object recognition ECCV, 2004.383 383

[10] Dalal, N., Triggs, B. Histograms of oriented gradients for human detection. CVPR,384 384

2005.385 385

[11] Oliva, A., Torralba, A. The role of context in object recognition. Trends Cogn386 386

Sci, 2007.387 387

[12] Torralba, A., Murphy, K., Freeman, W., Rubin, M.: Context-based vision system388 388

for place and object recognition ICCV, 2003.389 389

[13] Wolf, L., Bileschi, S. A critical view of context. IJCV 69, 2006.390 390

[14] Fink, M., Perona, P. Mutual boosting for contextual inference. NIPS, 2003.391 391

[15] Hoiem, D., Efros, A.A., Hebert, M. Putting objects in perspective. CVPR, 2006.392 392

[16] Ren, X., Malik, J. Learning a classification model for segmentation. ICCV, 2003.393 393

[17] Dempster, A.P., Laird, N.M., Rubin, D.B. Maximum likelihood from incomplete394 394

data via the em algorithm. JRSS, 1977.395 395

[18] Geman, S., Geman, D. Stochastic relaxation, gibbs distributions, and the bayesian396 396

restoration of images. Readings in computer vision: issues, problems, principles,397 397

and paradigms. (1987) 564–584398 398

[19] Everingham, M., et al. The 2005 pascal visual object classes challenge. MLCW,399 399

2005.400 400

[20] Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D.M., Jordan, M.I.401 401

Matching words and pictures. JMLR 3, 2003.402 402

[21] Saxena, A., Sun, M., Ng, A.Y. Learning 3-d scene structure from a single still403 403

image. ICCV, 2007.404 404

Date post:	18-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Learning Spatial Context: 1 Using Stuﬀ to Find Things...

Documents