ICCV 2011 Presentation

MANHATTAN SCENE UNDERSTANDING USING MONOCULAR, STEREO, AND 3D

FEATURES

Alex Flint, David Murray, and Ian ReidUniversity of Oxford

Monday, 14 November 11

“Manhattan Scene Understanding Using Monocular, Stereo, and 3D Features”Alex Flint, David Murray, Ian Reid

SEMANTICS IN GEOMETRIC MODELS

1. Motivation

2. Prior work

3. The indoor Manhattan representation

4. Probabilistic model and inference

5. Results and conclusion



MOTIVATION

Single View Computer Vision Multiple View Geometry

classroom (2.09) classroom (1.99) classroom (1.98) fastfood (!0.18) garage (!0.69) bathroom (!0.99) kitchen (!1.27) prisoncell (!1.53)cla

ssro

om

locker room (2.52) corridor (2.27) locker room (2.22) office (!0.04) prisoncell (!0.52) kindergarden (!0.86) bathroom (!1.16) bedroom (!1.40)

locke

r ro

om

restaurant (1.57) livingroom (1.55) pantry (1.53) fastfood (!0.12) waitingroom (!0.59) restaurant (!0.89) kitchen (!1.16) winecellar (!1.44)

din

ing

ro

om

mall (1.69) videostore (1.44) videostore (1.39) tv studio (!0.14) bathroom (!0.51) concert hall (!0.78) concert hall (!1.01) inside subway (!1.22)

vid

eo

sto

re

bathroom (2.45) bathroom (2.14) bedroom (2.01) laundromat (0.36) operating room(!0.23) dental office (!0.65) bookstore (!1.04) inside bus (!1.37)

ho

sp

ita

lro

om

library (2.34) library (1.94) warehouse (1.93) warehouse (!0.07) jewelleryshop (!0.53) laundromat (!0.87) toystore (!1.11) bowling (!1.32)

libra

ry

Figure 8. Classified images for a subset of scene categories for the ROI+Gist Segmentation model. Each row corresponds to a scenecategory. The name on top of each image denotes the ground truth category. The number in parenthesis is the classification confidence.The first three columns correspond to the highest confidence scores. The next five columns show 5 images from the test set sampled so thatthey are at equal distance from each other in the ranking provided by the classifier. The goal is to show which images/classes are near andfar away from the decision boundary.

[10] J. Ponce, T. L. Berg, M. Everingham, D. A. Forsyth,M. Hebert, S. Lazebnik, M. Marszalek, C. Schmid, B. C.Russell, A. Torralba, C. K. I. Williams, J. Zhang, and A. Zis-serman. Dataset issues in object recognition. In In TowardCategory-Level Object Recognition, pages 29–48. Springer,2006.

[11] A. Quattoni, M. Collins, and T. Darrell. Transfer learning forimage classification with sparse prototype representations. InCVPR, 2008.

[12] B. C. Russell, A. A. Efros, J. Sivic, W. T. Freeman, andA. Zisserman. Using multiple segmentations to discover ob-jects and their extent in image collections. In CVPR, 2006.

[13] J. Shi and J. Malik. Normalized cuts and image segmenta-tion. IEEE Transactions on Pattern Analysis and MachineIntelligence, 22(8):888–905, 1997.

[14] J. Sivic and A. Zisserman. Video Google: Efficient vi-sual search of videos. In J. Ponce, M. Hebert, C. Schmid,

and A. Zisserman, editors, Toward Category-Level Ob-ject Recognition, volume 4170 of LNCS, pages 127–144.Springer, 2006.

[15] M. Szummer and R. W. Picard. Indoor-outdoor image clas-sification. In CAIVD ’98: Proceedings of the 1998 Inter-national Workshop on Content-Based Access of Image andVideo Databases (CAIVD ’98), page 42, Washington, DC,USA, 1998. IEEE Computer Society.

[16] A. Torralba, K. Murphy, W. Freeman, and M. Rubin.Context-based vision system for place and object recogni-tion. In Intl. Conf. Computer Vision, 2003.

[17] A. Torralba, A. Oliva, M. S. Castelhano, and J. M. Hen-derson. Contextual guidance of eye movements and atten-tion in real-world scenes: the role of global features in ob-ject search. Psychological Review, 113(4):766–786, October2006.

Sky

Tree

Rock

HumanWater

BeachSand



MOTIVATION

Structure-from-motion does not immediately solve:

• Scene categorisation

• Object recognition

• Many scene understanding tasks

The multiple view setting is increasingly relevant• Powerful mobile devices with cameras

• Bandwidth no longer constrains video on the internet

• Depth sensing cameras becoming increasingly prevalent



We seek a representation that:• leads naturally to semantic-level scene understanding tasks;

• integrates both photometric and geometric data;

• is suitable for both monocular and multiple-view scenarios.

MOTIVATION

The indoor Manhattan representation (Lee et al, 2009)

• Parallel floor and ceiling planes

• Walls terminate at vertical boundaries

• A sub-class of Manhattan scenes

Lee, Kanade, Hebert, “Geometric reasoning for single image structure recovery”, CVPR 2009



Where would a person stand?

Where would doors be found?

What is the direction of gravity?

Is this an office or house?

How wide (in absolute units)?


Goal is to ignore clutter



PRIOR WORK

• Kosecka and Zhang, “Video Compass”, ECCV 2002

• Furukawa, Curless, Seitz, and Szeliski, “Manhattan World Stereo”, CVPR 2009

• Posner, Schroeter, and Newman, “Online generation of scene descriptions in urban environments”, RAS 2008

• Vasudevan, Gachter, Nguyen, Siegwart, “Cognitive maps for mobile robots -- an object-based approach”, RAS 2007

• Bao and Savarese, “Semantic Structure From Motion”, CVPR 2011

2.2 Context in Robotics 11

Figure 2.3: Semantic labels output by the system of Posner et al [PSN08].

2.2.2 Map–centric approaches

An alternative approach to deriving context in robotics applications is to integrate new mea-

surements into a map, and then reason about semantics within the map representation. In

general this approach enables stronger integration of measurements taken over several time

steps, at the cost of relying on the ability to correctly build a map.

Buschka and Saffiotti [BS02] have taken a map–centric approach to the problem of identi-

fying room boundaries within indoor environments and recognising the resultant rooms. A

series of laser range scans are fused into a 2D occupancy grid representing the probability

that each cell is occupied by some object or boundary. Rooms boundaries are identified by

applying dilation and erosion to the occupancy map, which are standard morphological fil-

ters from visual segmentation [FP02]. The authors demonstrate that this can be performed

with fixed computational cost by discarding old parts of the environment as the robot moves

through the environment.

The result of their algorithm is a series of “nodes” with topological connections between

them, which correspond to the various rooms and corridors within the robot’s environment

and the doorways that connect them. The authors proceed to characterise each node by the

2.3 Context in Computer Vision 13

Figure 2.4: Example of an object–centric map of [VGNS07]. The blue triangles show objectdetections, the red and green stars show doorways the system has identified, and the reddot shows the robot’s inferred place category for the outlined room, which in this case is anoffice.

and while this is not aligned exactly with our own goal it is still instructive to review these

contributions because the ideas they propose for inferring context are often separable from

the specific task on which the authors choose to demonstrate them.

2.3.1 Holistic Approaches

Gist Features

The work of Torralba et al. [Tor03] has been very influential in expounding the value of

contextual reasoning for vision. In their work, they compute a feature vector composed of

statistics from the entire image and use this to reason about the contents of the image. This

feature vector is termed the “gist” of the image and is computed as follows. First, an input

image is passed through a bank of Gabor filters at n orientations and m scales, producing nm

response images. Next, each response image is divided into a k�k grid. Finally the average

over each grid cell is computed for each response image, and these values are concatenated

to form the final feature vector of length nmk2.

In early work, Torralba et al. used the gist vector to learn about scene categories. They

Target image Depth map Depth normal map Mesh model Texture mappedmesh model

Figure 6. From left to right, a target image, a depth map, a depth normal map, and reconstructed models with and without texture mapping.

interior walls, a refrigerator and an elevator door with shinyreflections, the ground planes of outdoor scenes with badviewing angles, etc.).

5. ConclusionWe have presented a stereo algorithm tailored to recon-

struct an important class of architectural scenes, which areprevalent yet problematic for existing MVS algorithms. Thekey idea is to invoke a Manhattan World assumption, whichreplaces the conventional smoothness prior with a struc-tured model of axis-aligned planes that meet in restrictedways to form complex surfaces. This approach produces re-

markably clean and simple models, and performs well evenin texture-poor areas of the scene.

While the focus of this paper was computing depth maps,future work should consider practical methods for mergingthese models into larger scenes, as existing merging meth-ods (e.g., [12]) do not leverage the constrained structure ofthese scenes. It would also be interesting to explore priorsfor modeling a broader range of architectural scenes.

Acknowledgments: This work was supported in part byNational Science Foundation grant IIS-0811878, the Officeof Naval Research, the University of Washington AnimationResearch Labs, and Microsoft.



PRIOR WORK

• Delage, Lee, and Ng, “A dynamic Bayesian network for autonomous 3d reconstruction from a single indoor image”, CVPR 2006

• Hoiem, Efros, and Ebert, “Geometric context from a single image”, CVPR 2005

• Saxena, Sun, and Ng, “Make3d: Learning 3D scene structure from a single still image, PAMI 2008

• Lee, Kanade, Hebert, “Geometric reasoning for single image structure recovery”, CVPR 2009

our work as potentially complementing some of these ap-proaches. For indoor images such as in Figure 1, meth-ods based on “3d metrology” hold some promise. Givensufficient human labeling/human-specified constraints, ef-ficient techniques can indeed be applied to generate a 3dreconstruction of these scenes. [4, 5, 18] The drawback ofthese methods is that they require a significant amount ofhuman input (for example, specifying the correspondencesbetween lines in the image and the edges of a referencemodel).

Recent work strongly suggests that 3d information canbe efficiently recovered using Bayesian methods, in whichvisual cues are combined with some prior knowledge on thegeometry of a scene. For example, Kosaka and Kak [11]presented a navigation algorithm that allows a monocularrobot to track its position in a building by associating vi-sual cues, such as lines and corners, with the configura-tion of hallways on a plan. However, this approach wouldfail in a new environment where the plan of the room isnot available beforehand. To succeed more generally, oneneeds to rely on a more flexible geometric model. With aManhattan world assumption on a given scene (i.e. one thatcontains many orthogonal shapes, like in many urban en-vironments), Coughlan and Yuille [3], and Schindler andDellaert [16] have developed efficient techniques to recoverautonomously both extrinsic and intrinsic camera param-eters from a single image. Another successful attempt inthe field of monocular 3d reconstruction was developed byHan and Zhu [7, 8], which used models both of man-made“block-shaped objects” and of some natural objects, such astrees and grass. Unfortunately, this approach has so far beenapplied only to fairly simple images, and seems unlikelyto scale in its present form to complex, textured images asshown in Figure 1.

Figure 2. 3d reconstruction of a corridor fromsingle image presented in Figure 1 using ourautonomous algorithm.

Hoiem et al. [9] also developed independently an al-gorithm that focuses on generating aesthetically pleasing“pop-up book” versions of outdoor pictures. Although theiralgorithm is related in spirit, it is different from ours in de-tail. We will describe a comparison of our method with

theirs in Section 4.2. Using supervised learning, [15] givean approach for estimating a depth map from a monocu-lar image, that applies to outdoor/unstructured scenes. (Seealso [13].)

Our approach uses a dynamic Bayesian network (DBN)to approximate a distribution over the possible structures ofa scene. Assuming a “floor-wall” geometry in the scene,the model uses a range of visual cues to find the most likelyfloor-wall boundary in each column of the image. Whenthe image is produced under perspective geometry and con-tains only a floor and vertical walls, we show that this canbe used to obtain a 3d reconstruction. As an example, Fig-ure 2 shows the 3d reconstruction generated by our algo-rithm using the image in Figure 1. In Section 2, we de-fine the “floor-wall” geometry and outline our method forrecovering 3d information. Section 3 develops the DBN,its training process, and the methods for inferring the mostlikely floor-wall boundary in an image. Finally, in Section 4we present a quantitative analysis of the accuracy of recon-struction on test images, and demonstrate the robustness ofthe algorithm by applying it to a diverse set of indoor im-ages.

2. Background Material

In this paper, we focus on 3d reconstruction from indoorscenes of the sort that are typically seen by an indoor mo-bile robot. We make the following assumptions about thecamera:

1. The image is obtained by perspective projection, us-ing a calibrated camera2 with a calibration matrix K.Thus, as presented in Figure 3, a point Q in the 3dworld is projected to pixel coordinate q (representedin homogeneous coordinates) in the image if and onlyif:3

Q ! K!1q. (1)

2. The image contains a set of N vanishing points corre-sponding to N directions, with one of them normal tothe floor plane. (For example, in a Manhattan world inwhich all surfaces are orthogonal, N = 3.)4

2A calibrated camera means that the orientation of each pixel relativeto the optical axis is known.

3 Here, K, q and Q are as follows:

K =

!

"f 0 !u

0 f !v

0 0 1

#

$ , q =

!

"uv1

#

$ , Q =

!

"xyz

#

$ .

Thus, Q is projected onto a point q in the image plane if and only if thereis some constant ! so that Q = !K!1q.

4Vanishing points in the image plane are the points where all lines thatare parallel in 3d space meet in the image. This is a consequence of usingperspective geometry. Because of the frequency of parallel lines in artifi-cial scenes, they form important cues for depth reconstruction. In a scenethat has mainly orthogonal planes—such as in many indoor scenes—the

Object Detection

1

Make3D: Learning 3D Scene Structure from aSingle Still Image

Ashutosh Saxena, Min Sun and Andrew Y. Ng

Abstract—We consider the problem of estimating detailed3-d structure from a single still image of an unstructuredenvironment. Our goal is to create 3-d models which are bothquantitatively accurate as well as visually pleasing.For each small homogeneous patch in the image, we use a

Markov Random Field (MRF) to infer a set of “plane parame-ters” that capture both the 3-d location and 3-d orientation of thepatch. The MRF, trained via supervised learning, models bothimage depth cues as well as the relationships between differentparts of the image. Other than assuming that the environmentis made up of a number of small planes, our model makes noexplicit assumptions about the structure of the scene; this enablesthe algorithm to capture much more detailed 3-d structure thandoes prior art, and also give a much richer experience in the 3-dflythroughs created using image-based rendering, even for sceneswith significant non-vertical structure.Using this approach, we have created qualitatively correct 3-d

models for 64.9% of 588 images downloaded from the internet.We have also extended our model to produce large scale 3dmodels from a few images.1

Index Terms—Machine learning, Monocular vision, Learningdepth, Vision and Scene Understanding, Scene Analysis: Depthcues.

I. INTRODUCTIONUpon seeing an image such as Fig. 1a, a human has no difficulty

understanding its 3-d structure (Fig. 1c,d). However, inferringsuch 3-d structure remains extremely challenging for currentcomputer vision systems. Indeed, in a narrow mathematical sense,it is impossible to recover 3-d depth from a single image, sincewe can never know if it is a picture of a painting (in which casethe depth is flat) or if it is a picture of an actual 3-d environment.Yet in practice people perceive depth remarkably well given justone image; we would like our computers to have a similar senseof depths in a scene.Understanding 3-d structure is a fundamental problem of

computer vision. For the specific problem of 3-d reconstruction,most prior work has focused on stereovision [4], structure frommotion [5], and other methods that require two (or more) images.These geometric algorithms rely on triangulation to estimatedepths. However, algorithms relying only on geometry often endup ignoring the numerous additional monocular cues that can alsobe used to obtain rich 3-d information. In recent work, [6]–[9]exploited some of these cues to obtain some 3-d information.Saxena, Chung and Ng [6] presented an algorithm for predictingdepths from monocular image features. [7] used monocular depthperception to drive a remote-controlled car autonomously. [8], [9]built models using a strong assumptions that the scene consistsof ground/horizontal planes and vertical walls (and possibly sky);

Ashutosh Saxena, Min Sun and Andrew Y. Ng are with ComputerScience Department, Stanford University, Stanford, CA 94305. Email:{asaxena,aliensun,ang}@cs.stanford.edu.1Parts of this work were presented in [1], [2] and [3].

Fig. 1. (a) An original image. (b) Oversegmentation of the image to obtain“superpixels”. (c) The 3-d model predicted by the algorithm. (d) A screenshotof the textured 3-d model.

these methods therefore do not apply to the many scenes that arenot made up only of vertical surfaces standing on a horizontalfloor. Some examples include images of mountains, trees (e.g.,Fig. 15b and 13d), staircases (e.g., Fig. 15a), arches (e.g., Fig. 11aand 15k), rooftops (e.g., Fig. 15m), etc. that often have muchricher 3-d structure.In this paper, our goal is to infer 3-d models that are both

quantitatively accurate as well as visually pleasing. We usethe insight that most 3-d scenes can be segmented into manysmall, approximately planar surfaces. (Indeed, modern computergraphics using OpenGL or DirectX models extremely complexscenes this way, using triangular facets to model even verycomplex shapes.) Our algorithm begins by taking an image, andattempting to segment it into many such small planar surfaces.Using a superpixel segmentation algorithm, [10] we find an over-segmentation of the image that divides it into many small regions(superpixels). An example of such a segmentation is shown inFig. 1b. Because we use an over-segmentation, planar surfacesin the world may be broken up into many superpixels; however,each superpixel is likely to (at least approximately) lie entirelyon only one planar surface.For each superpixel, our algorithm then tries to infer the 3-

d position and orientation of the 3-d surface that it came from.This 3-d surface is not restricted to just vertical and horizontaldirections, but can be oriented in any direction. Inferring 3-dposition from a single image is non-trivial, and humans do it usingmany different visual depth cues, such as texture (e.g., grass hasa very different texture when viewed close up than when viewedfar away); color (e.g., green patches are more likely to be grass on

1

Make3D: Learning 3D Scene Structure from aSingle Still Image

Ashutosh Saxena, Min Sun and Andrew Y. Ng

Abstract—We consider the problem of estimating detailed3-d structure from a single still image of an unstructuredenvironment. Our goal is to create 3-d models which are bothquantitatively accurate as well as visually pleasing.For each small homogeneous patch in the image, we use a

Markov Random Field (MRF) to infer a set of “plane parame-ters” that capture both the 3-d location and 3-d orientation of thepatch. The MRF, trained via supervised learning, models bothimage depth cues as well as the relationships between differentparts of the image. Other than assuming that the environmentis made up of a number of small planes, our model makes noexplicit assumptions about the structure of the scene; this enablesthe algorithm to capture much more detailed 3-d structure thandoes prior art, and also give a much richer experience in the 3-dflythroughs created using image-based rendering, even for sceneswith significant non-vertical structure.Using this approach, we have created qualitatively correct 3-d

models for 64.9% of 588 images downloaded from the internet.We have also extended our model to produce large scale 3dmodels from a few images.1

Index Terms—Machine learning, Monocular vision, Learningdepth, Vision and Scene Understanding, Scene Analysis: Depthcues.

I. INTRODUCTIONUpon seeing an image such as Fig. 1a, a human has no difficulty

understanding its 3-d structure (Fig. 1c,d). However, inferringsuch 3-d structure remains extremely challenging for currentcomputer vision systems. Indeed, in a narrow mathematical sense,it is impossible to recover 3-d depth from a single image, sincewe can never know if it is a picture of a painting (in which casethe depth is flat) or if it is a picture of an actual 3-d environment.Yet in practice people perceive depth remarkably well given justone image; we would like our computers to have a similar senseof depths in a scene.Understanding 3-d structure is a fundamental problem of

computer vision. For the specific problem of 3-d reconstruction,most prior work has focused on stereovision [4], structure frommotion [5], and other methods that require two (or more) images.These geometric algorithms rely on triangulation to estimatedepths. However, algorithms relying only on geometry often endup ignoring the numerous additional monocular cues that can alsobe used to obtain rich 3-d information. In recent work, [6]–[9]exploited some of these cues to obtain some 3-d information.Saxena, Chung and Ng [6] presented an algorithm for predictingdepths from monocular image features. [7] used monocular depthperception to drive a remote-controlled car autonomously. [8], [9]built models using a strong assumptions that the scene consistsof ground/horizontal planes and vertical walls (and possibly sky);

Ashutosh Saxena, Min Sun and Andrew Y. Ng are with ComputerScience Department, Stanford University, Stanford, CA 94305. Email:{asaxena,aliensun,ang}@cs.stanford.edu.1Parts of this work were presented in [1], [2] and [3].

Fig. 1. (a) An original image. (b) Oversegmentation of the image to obtain“superpixels”. (c) The 3-d model predicted by the algorithm. (d) A screenshotof the textured 3-d model.

these methods therefore do not apply to the many scenes that arenot made up only of vertical surfaces standing on a horizontalfloor. Some examples include images of mountains, trees (e.g.,Fig. 15b and 13d), staircases (e.g., Fig. 15a), arches (e.g., Fig. 11aand 15k), rooftops (e.g., Fig. 15m), etc. that often have muchricher 3-d structure.In this paper, our goal is to infer 3-d models that are both

quantitatively accurate as well as visually pleasing. We usethe insight that most 3-d scenes can be segmented into manysmall, approximately planar surfaces. (Indeed, modern computergraphics using OpenGL or DirectX models extremely complexscenes this way, using triangular facets to model even verycomplex shapes.) Our algorithm begins by taking an image, andattempting to segment it into many such small planar surfaces.Using a superpixel segmentation algorithm, [10] we find an over-segmentation of the image that divides it into many small regions(superpixels). An example of such a segmentation is shown inFig. 1b. Because we use an over-segmentation, planar surfacesin the world may be broken up into many superpixels; however,each superpixel is likely to (at least approximately) lie entirelyon only one planar surface.For each superpixel, our algorithm then tries to infer the 3-

d position and orientation of the 3-d surface that it came from.This 3-d surface is not restricted to just vertical and horizontaldirections, but can be oriented in any direction. Inferring 3-dposition from a single image is non-trivial, and humans do it usingmany different visual depth cues, such as texture (e.g., grass hasa very different texture when viewed close up than when viewedfar away); color (e.g., green patches are more likely to be grass onFigure 13. Examples (Best viewed in color)

Figure 14. Examples with occluding objects. Unobstructed view of the ceiling-wall boundary helps finding the underlying building struc-ture. (Best viewed in color)

Figure 15. Failure examples. (Best viewed in color)

Figure 16. Examples of images downloaded from the web. Top row: Success. Bottom row: Failure. (Best viewed in color)

Figure 13. Examples (Best viewed in color)

Figure 14. Examples with occluding objects. Unobstructed view of the ceiling-wall boundary helps finding the underlying building struc-ture. (Best viewed in color)

Figure 15. Failure examples. (Best viewed in color)

Figure 16. Examples of images downloaded from the web. Top row: Success. Bottom row: Failure. (Best viewed in color)



Given:• K views of a scene

• Camera poses from structure-from-motion

• Point cloud

Recover an indoor Manhattan model

PROBLEM STATEMENT



Pre-processing1. Detect vanishing points

2. Estimate Manhattan homology

3. Vertically rectify images

M = {c1, (r1, a1), . . . , ck�1, (rk�1, ak�1), ck} (33)

C(M) =

c

kX

x=0

⇡(x, yx

)�kX

i=0

�(M, i) (34)

ˆM = argmax

M

X

x

⇡(x, yx

)�kX

i=0

�(M, i) (35)

�(M, i) =

8><

>:

log(�1), if ci

is a concave cornerlog(�2), if c

i

is a concex cornerlog(�3), if c

i

is an occluding corner(36)

ck

= W (37)

ck

< W (38)

logP (M |X) =

z }| {X

x

⇡(x, yx

)�z }| {X

i

�(M, i) (39)

likelihood prior

[1] Flint, Mei, Murray, and Reid, “A Dynamic Programming Approach to Reconstructing Building Interiors”, In ECCV 2010

2. Background

The Manhattan world assumption was introduced byCoughlan and Yuille[?] over a decade ago and has seenincreasing attention in the computer vision literature overpast years [?, ?, ?, ?, ?]. Furukawa et al. [?] pro-posed a Manhattan–world stereo algorithm based on graphcuts. While their approach is concerned with dense photo–realistic reconstructions, ours is intended to capture seman-tic properties of the scene using a concise representation.The output of their approach — a polygonal mesh — hasno immediate semantic interpretation, whereas our models,though less detailed, come packaged with a direct interpre-tation. A by–product is efficiency: we count computationtime in hundreds of milliseconds, where as Furukawa et al.report waiting more than an hour.

Another approach to interpreting Manhattan worlds is tomodel scenes as a union of cuboids. This approach has along history beginning with Roberts’ 1965 thesis [?], andhas recently been revisited using modern probabilistic tech-niques [?, ?].

Lee et al. [?] first proposed indoor Manhattan models(a sub–class of general Manhattan models) for monocularreconstructions. They used a branch–and–bound algorithmtogether with a line–sweep heuristic for approximate infer-ence. Flint et al. [?] employed a similar model but showeda dynamic programming algorithm that performed exact in-ference in polynomial time. In earlier work [?] Flint et

al. also demonstrated Manhattan reconstructions integratedwith a SLAM system, but this work inferred models fromsingle frames and then extrapolated these forward in time.In contrast, our work incorporates both multiple view geom-etry and 3D points directly into a joint inference procedure.We also learn parameters in a Bayesian framework, whereas neither Lee nor Flint utilized training data in any form.

Felzenszwalb and Veksler [?] posed the reconstruc-tion problem in terms of energy minimization, which theyshowed could be solved using dynamic programming, whileBarinova et al. [?] modeled outdoor scenes using a CRF.However, these approaches do not permit strong geometricconstraints and so cannot be extended to multiple views.

Semantic scene understanding has, broadly speaking,seen less attention within the multiple view community.The CamVid [?] database of outdoor videos with seman-tic segmentations is an important and encouraging excep-tion. Brostow et al. [?] showed that simple structure–from–motion cues lead to pleasing segmentations. Sturgess et al.[?] extended this approach to a CRF framework. We com-pare our method with this approach in section 6.

3. Proposed Model

In this section we describe the indoor Manhattan model.We consider three sensor modalities: monocular image fea-

tures, stereo features, and 3D point clouds. For each wepresent a generative model relating observed features to theManhattan scene structure, which we denote M . For eachsensor modality we show that MAP inference can be re-duced to maximization over a payoff function ⇡(x, y). Thisallows us to present a unified dynamic programming solu-tion in section 4, which efficiently solves MAP inferencefor all three sensor modalities.

General Manhattan environments have structural sur-faces oriented in three cardinal orientations. Indoor Man-hattan environments are a special case that consist of a floorplane, a parallel ceiling plane, and a set of vertical walls ex-tending between them. Each wall extends all the way fromthe floor to ceiling, and walls meet at vertical edges. Wealways consider environments observed from a camera lo-cated between the floor and ceiling. Since each wall ex-tends from floor to ceiling, indoor Manhattan environmentsalways project as a linear chain of walls in the image, asshown in figure ??. Further, the edges at which adjacentwalls meet can be categorized as concave, convex, or oc-cluding, as illustrated in figure ?? and discussed further in[?].

We assume that vanishing points for the three Manhattandirections are given. We use the vanishing point detector de-scribed by Zhang et al. [?] in the monocular setting and thatof [?] in the multiple view setting. It will greatly simplifythe remainder of this paper if we can assume that verticallines in the world appear vertical in the image. To this endwe apply the simple rectification procedure of [?].

We now describe our parametrization for indoor Manhat-tan models. Let the image dimensions be N

x

⇥Ny

. Follow-ing rectification, the vertical seams at which adjacent wallsmeet project to vertical lines, so each image column inter-sects exactly one wall segment. Let the top and bottom ofthe wall in column x be p

x

= (x, yx

) and q

x

= (x, y0x

) re-spectively (depicted in figure ??). Since each p

x

lies on thefloor plane and each q

x

lies on the ceiling plane, we have

p

x

= Hq

x

. (1)

where H is a planar homology [?]. We show how to recoverH in section 3.5. Once H is known, any indoor Manhattanmodel is fully described by the values {y

x

}, leading to thesimple parametrization,

M = {yx

}Nx

x=1 . (2)

We query this parametrization as follows. To check whethera pixel (x0, y0) lies on a vertical or horizontal surface wesimply need to check whether y0 is between y

x0 and y0x0

.If we know the 3D position of the floor and ceiling planesthen we can recover the depth of every pixel as follows. Ifthe pixel lies on the floor or ceiling then we simply back–project a ray onto the corresponding plane. If not, we back–

In (Flint et al, ECCV 2010) we described an exact dynamic programming solution for problems of this form.

Express posterior on models as

Structure recovery



Preliminaries• The mapping from ceiling plane to floor plane,

is a planar homology.

• Following rectification, transforms points along image columns.

• Given the label yx at some column x, the orientation

for every pixel in that column can be recovered as follows.

1. Compute yx’ = [x yx 1]T

2. Pixels between yx and yx’ are vertical, others are

horizontal

4 Alex Flint, Christopher Mei, David Murray, and Ian Reid

3. Rectify vertical lines. (Section 3.3)4. Obtain weak orientation estimates. (Section 3.4)5. Estimate the final model. (Sections 4 and 5)

3.1 Identifying dominant directions

We identify three dominant directions by estimating mutually orthogonal van-ishing points in the image. Our approach is similar to Kosecka and Zhang [2], inwhich k–means clustering provides an initial estimate that is refined using EM.We assume that the vertical direction in the world corresponds to the vanish-ing point with largest absolute y–coordinate, which we label v

v

. The other twovanishing points are denoted v

l

and vr

.If the camera intrinsics are unknown then we construct the camera matrix

K from the detected vanishing points by assuming that the camera centre isat the image centre and choosing a focal length and aspect ratio such that thecalibrated vanishing points are mutually orthogonal.

3.2 Identifying the floor and ceiling planes.

An indoor Manhattan scene has exactly one floor and one ceiling plane, bothwith normal direction v

v

. It will be useful in the following sections to haveavailable the mapping H

c!f

between the image locations of ceiling points andthe image locations of the floor points that are vertically below them (see Figure1b). H

c!f

is a planar homology with axis h = vl

⇥vr

and vertex vv

[15] and canbe recovered given the image location of any pair of corresponding floor/ceilingpoints (x

f

,xc

) as

Hc!f

= I + µvv

hT

vv

· h , (1)

where µ =< vv

,xc

,xf

,xc

⇥ xf

⇥ h > is the characteristic cross ratio of Hc!f

.Although we do not have a priori any such pair (x

f

,xc

), we can recoverH

c!f

using the following RANSAC algorithm. First, we sample one point xc

from the region above the horizon in the Canny edge map, then we sample asecond point x

f

collinear with the first and vv

from the region below the horizon.

We compute the hypothesis map Hc!f

as described above, which we then score

by the number of edge pixels that Hc!f

maps onto other edge pixels (accordingto the Canny edge map). After repeating this for a fixed number of iterationswe return the hypothesis with greatest score.

Many images contain either no view of the floor or no view of the ceiling. Insuch cases H

c!f

is unimportant since there are no corresponding points in theimage. If the best H

c!f

output from the RANSAC process has a score below athreshold k

t

then we set µ to a large value that will transfer all pixels outsidethe image bounds. H

c!f

will then have no impact on the estimated model.





v


l

and vr





v


c!f


c!f


⇥vr

and vertex vv


f

,xc

) as

Hc!f

= I + µvv

hT

vv

· h , (1)

where µ =< vv

,xc

,xf

,xc

⇥ xf



f

,xc

), we can recoverH

c!f



f








c!f


c!f


t


c!f






v


l

and vr





v


c!f


c!f


⇥vr

and vertex vv


f

,xc

) as

Hc!f

= I + µvv

hT

vv

· h , (1)

where µ =< vv

,xc

,xf

,xc

⇥ xf



f

,xc

), we can recoverH

c!f



f








c!f


c!f


t


c!f


(x,yx)

(x,yx’)



MODEL

logP (M |X) =

z }| {X

x

⇡(x, yx

)�z }| {X

i

�(M, i)

M = {c1, (r1, a1), . . . , ck�1, (rk�1, ak�1), ck} (33)

C(M) =

c

kX

x=0

⇡(x, yx

)�kX

i=0

�(M, i) (34)

ˆM = argmax

M

X

x

⇡(x, yx

)�kX

i=0

�(M, i) (35)

ˆM = argmax

M

P (M)P (Xmono |M)P (Xstereo |M)P (X3D |M)

(36)

P (M |X) = P (Xmono |M)P (Xstereo |M)P (X3D |M)P (M)

(37)

logP (M |X) = logP (Xmono |M)+logP (Xstereo |M)+logP (X3D |M)+logP (M)

(38)

�(M, i) =

8><

>:

log �1, if ci

is a concave cornerlog �2, if c

i

is a concex cornerlog �3, if c

i


ck

= W (40)

ck

< W (41)

logP (M |X) =

z }| {X

x

⇡(x, yx

)�z }| {X

i

�(M, i) (42)

P (M) =

1

Z�1

n1�2n2�3

n3 (43)

logP ( | M) =

X

p

logP ( p | a⇤p) + c (44)

⇡mono(x, yx) =X

y

0

logP (

i

| a⇤i

) (45)

P (M) =

1

Z�1

n1�2n2�3

n3 (46)

logP (D|M) =

X

x

⇣X

i2Dx

logP (di

| pi

, yx

)

⌘, (47)

logP (D | M) =

X

x

⇣X

i2Dx

logP (di

| pi

, yx

)

⌘, (48)

logP (X | M) =

X

x

⇡(x, yx

) (49)

logP (I1:K | M) =

X

p2I

o

KX

k=1

z}|{PC (p,

z }| {reproj

k

(p,M)) ,

(50)

M = {c1, (r1, a1), . . . , ck�1, (rk�1, ak�1), ck} (33)

C(M) =

c

kX

x=0

⇡(x, yx

)�kX

i=0

�(M, i) (34)

ˆM = argmax

M

X

x

⇡(x, yx

)�kX

i=0

�(M, i) (35)

ˆM = argmax

M


(36)


(37)


(38)

�(M, i) =

8><

>:

log �1, if ci


i


i


ck

= W (40)

ck

< W (41)

logP (M |X) =

z }| {X

x

⇡(x, yx

)�z }| {X

i

�(M, i) (42)

P (M) =

1

Z�1

n1�2n2�3

n3 (43)

logP ( | M) =

X

p

logP ( p | a⇤p) + c (44)

⇡mono(x, yx) =X

y

0

logP (

i

| a⇤i

) (45)

P (M) =

1

Z�1

n1�2n2�3

n3 (46)

logP (D|M) =

X

x

⇣X

i2Dx

logP (di

| pi

, yx

)

⌘, (47)

logP (D | M) =

X

x

⇣X

i2Dx

logP (di

| pi

, yx

)

⌘, (48)

logP (X | M) =

X

x

⇡(x, yx

) (49)

logP (I1:K | M) =

X

p2I

o

KX

k=1

z}|{PC (p,

z }| {reproj

k

(p,M)) ,

(50)

Xmono Xstereo X3D

M



M = {c1, (r1, a1), . . . , ck�1, (rk�1, ak�1), ck} (33)

C(M) =

c

kX

x=0

⇡(x, yx

)�kX

i=0

�(M, i) (34)

ˆM = argmax

M

X

x

⇡(x, yx

)�kX

i=0

�(M, i) (35)

�(M, i) =

8><

>:

log �1, if ci


i


i


ck

= W (37)

ck

< W (38)

logP (M |X) =

z }| {X

x

⇡(x, yx

)�z }| {X

i

�(M, i) (39)

P (M) =

1

Z�1

n1�2n2�3

n3 (40)

logP (M | ) =X

p

logP ( p | a⇤p) + c (41)

⇡mono(x, yx) =X

y

0

logP (

i

| a⇤i

) (42)

M = {c1, (r1, a1), . . . , ck�1, (rk�1, ak�1), ck} (33)

C(M) =

c

kX

x=0

⇡(x, yx

)�kX

i=0

�(M, i) (34)

ˆM = argmax

M

X

x

⇡(x, yx

)�kX

i=0

�(M, i) (35)

�(M, i) =

8><

>:

log �1, if ci


i


i


ck

= W (37)

ck

< W (38)

logP (M |X) =

z }| {X

x

⇡(x, yx

)�z }| {X

i

�(M, i) (39)

P (M) =

1

Z�1

n1�2n2�3

n3 (40)

logP (M | ) =X

p

logP ( p | a⇤p) + c (41)

⇡mono(x, yx) =X

y

0

logP (

i

| a⇤i

) (42)

P (M) =

1

Z�1

n1�2n2�3

n3 (43)

concave convex occluding

Prior

M = {c1, (r1, a1), . . . , ck�1, (rk�1, ak�1), ck} (33)

C(M) =

c

kX

x=0

⇡(x, yx

)�kX

i=0

�(M, i) (34)

ˆM = argmax

M

X

x

⇡(x, yx

)�kX

i=0

�(M, i) (35)

�(M, i) =

8><

>:

log(�1), if ci


i


i


ck

= W (37)

ck

< W (38)

logP (M |X) =

z }| {X

x

⇡(x, yx

)�z }| {X

i

�(M, i) (39)



Likelihood For Photometric Features

and that of [8] in the multiple view setting. It will greatlysimplify the remainder of this paper if we can assume thatvertical lines in the world appear vertical in the image. Tothis end we apply the simple rectification procedure of [8].

We now describe our parametrization for indoor Manhat-tan models. Let the image dimensions be Nx⇥Ny . Follow-ing rectification, the vertical seams at which adjacent wallsmeet project to vertical lines, so each image column inter-sects exactly one wall segment. Let the top and bottom ofthe wall in column x be px = (x, yx) and qx = (x, y⇥x) re-spectively (depicted in figure 3). Since each px lies on thefloor plane and each qx lies on the ceiling plane, we have

px = Hqx . (1)

where H is a planar homology [5]. We show how to recoverH in section 3.5. Once H is known, any indoor Manhattanmodel is fully described by the values {yx}, leading to thesimple parametrization,

M = {yx}Nxx=1 . (2)

We query this parametrization as follows. To check whethera pixel (x0, y0) lies on a vertical or horizontal surface wesimply need to check whether y0 is between yx0 and y⇥x0

.If we know the 3D position of the floor and ceiling planesthen we can recover the depth of every pixel as follows. Ifthe pixel lies on the floor or ceiling then we simply back–project a ray onto the corresponding plane. If not, we back–project onto the vertical plane defining the wall at that col-umn (the depth of which we can recover from yx0 ). Note inparticular that the orientation and depth of a pixel can be re-covered from just the floor/wall intersection in its column;this will be important in later sections.

We now turn to the optimization framework that eachsubsequent section will feed into. Let {ci} index thecolumns at which neighbouring walls meet in M . We definethe payoff for M as

�(M) =Nx⇥

x=1

⌅(x, yx)�⇥

i

�(ci) (3)

where the payoff matrix ⌅ assigns payoffs for models withfloor/wall intersections that pass through each pixel, and �is a per–corner regulariser which penalizes complex mod-els. Note that the value of ⌅(x, y) is not restricted to depen-dence on pixel (x, y), nor even to a local region about thatpixel; indeed, the payoff functions described in the follow-ing sections incorporate image evidence from widely sepa-rated image regions.

3.1. Monocular featuresTo infer indoor Manhattan models from monocular im-

ages we assume the graphical model shown in figure 4. We

Figure 4. The graphical model relating building structures M tomonocular image features �. p = (x, y) is a pixel location and ais the orientation predicted (deterministically) by M at p.

turn first to the prior P (M | �). For a model with n1 con-cave corners, n2 convex corners, and n3 occluding corners(c.f . figure 3), our prior on models is

P (M | �) =1

Z⇤1

n1⇤2n2⇤3

n3 (4)

which corresponds to a fixed probability for “events” cor-responding to each type of corner and penalizes models foradditional complexity. Z is a normalizing constant.

Our model includes hidden orientation variables ai ⇤{1, 2, 3} for each pixel, with values corresponding to thethree Manhattan orientations (shown as red, green, and blueregions in figure 1). As described in section 3, a is deter-ministic given the model M . We assume a linear likelihoodfor pixel features ⇥,

P (⇥ | a) = waT⇥�

j waT⇥j

. (5)

We now derive MAP inference. The posterior on M is

P (M | ⇥) = ⇥P (M)⇤

i

P (⇥i | a�i ) (6)

where a�i is the orientation deterministically predicted bymodel M at pixel pi and ⇥ is a normalizing constant. Wehave omitted P (ai | M) since it equals 1 for a�i and 0 oth-erwise. Taking logarithms,

logP (M | ⇥) = n1⇤⇥1 + n2⇤

⇥2 + n3⇤

⇥3

+⇥

i

logP (⇥i | a�i ) + k (7)

where ⇤⇥3 = log ⇤3 and similarly for the other penalties, and

k corresponds to the normalizing denominators in (6) and(4), which we henceforth drop since it makes no differenceto the optimization to come. We can now put (7) into payoffform (3) by writing

⌅mono(x, yx) =⇥

y0

logP (⇥i | a�i )

�mono(c) =� ⇤c

(8)

where ⇤c is one of ⇤1, ⇤2, or ⇤3 according to the categoryof corner c. We show how to maximize payoffs of this formin section 4, which will allow us to solve MAP inference.

0 S

1

Z

D

M = {c1, (r1, a1), . . . , ck�1, (rk�1, ak�1), ck} (33)

C(M) =

c

kX

x=0

⇡(x, yx

)�kX

i=0

�(M, i) (34)

ˆM = argmax

M

X

x

⇡(x, yx

)�kX

i=0

�(M, i) (35)

�(M, i) =

8><

>:

log(�1), if ci


i


i


ck

= W (37)

ck

< W (38)

logP (M |X) =

z }| {X

x

⇡(x, yx

)�z }| {X

i

�(M, i) (39)

P (M) =

1

Z�1

n1�2n2�3

n3 (40)

logP (M | ) =X

p

logP ( p | a⇤p) + c (41)

⇡mono(x, yx) =X

y

0

logP (

i

| a⇤i

) (42)

M = {c1, (r1, a1), . . . , ck�1, (rk�1, ak�1), ck} (33)

C(M) =

c

kX

x=0

⇡(x, yx

)�kX

i=0

�(M, i) (34)

ˆM = argmax

M

X

x

⇡(x, yx

)�kX

i=0

�(M, i) (35)

�(M, i) =

8><

>:

log �1, if ci


i


i


ck

= W (37)

ck

< W (38)

logP (M |X) =

z }| {X

x

⇡(x, yx

)�z }| {X

i

�(M, i) (39)

P (M) =

1

Z�1

n1�2n2�3

n3 (40)

logP ( | M) =

X

p

logP ( p | a⇤p) + c (41)

⇡mono(x, yx) =X

y

0

logP (

i

| a⇤i

) (42)

P (M) =

1

Z�1

n1�2n2�3

n3 (43)

logP (D|M) =

X

x

⇣X

i2Dx

logP (di

| pi

, yx

)

⌘, (44)

logP (D | M) =

X

x

⇣X

i2Dx

logP (di

| pi

, yx

)

⌘, (45)

logP (X | M) =

X

x

⇡(x, yx

) (46)

logP (I1:K | M) =

X

p2I

o

KX

k=1

z}|{PC (p,

z }| {reproj

k

(p,M)) ,

(47)



Likelihood For Photoconsistency Features

photo-consistency measure reprojection of p into frame k

Figure 5. Pixel correspondences across multiple views are com-puted by back–projection onto the model M followed by re–projection into auxiliary views.

3.2. Multiple view featuresWe now formulate the payoff function �stereo for the case

that multiple views of the scene are available. We assumeone base frame I0 and M auxiliary frames I1, . . . , IM . Weassume that poses are given for each camera, as output forexample by a structure–from–motion system, and that cam-eras are calibrated. We normalize images intensities to zeromean and unit variance.

Intuitively, we treat inference in this settings as follows.We consider models M in terms of their projection intoI0. We explained in section 3 that models parametrizedin image coordinates specify unique 3D models. Any hy-pothesized model can therefore be re–projected into aux-iliary frames, giving pixel–wise correspondences betweenframes as shown in figure 5. From this we compute a photo–consistency measure PC(·), which provides the likelihoodP ({Ik} | M). The prior remains as in (4).

Optimizing over photo–consistency has been standard inthe stereo literature for several decades [16]; our contri-bution is to show that (i) in the particular case of indoorManhattan models, photo–consistency can be expressed asa payoff matrix; (ii) that we can therefore perform efficientand exact global optimization; and (iii) that this fits nat-urally within a Bayesian framework alongside monocularand 3D features.

Our approach could also be cast as solving the generalstereo problem where in place of priors based on variouspixel–wise norms, our prior assigns zero probability to allnon–indoor–Manhattan reconstructions.

Let reprojk(p;M) be the re–projection of pixel p fromthe base frame I0 into auxiliary frame Ik via model M .Then

logP ({Ik} |M) =�

p�Io

M�

k=1

PC(p, reprojk(p,M)) , (9)

where in our experiments PC(p, q) is the sum of squareddifferences between pixels p and q.

We explained in section 3 that the depth of each pixel canbe recovered from the location of the floor/wall intersectionyx in column x. Hence we can replace reprojk(p;M) with

Figure 6. The graphical model relating indoor Manhattan modelsto 3D points. The hidden variable t indicates whether the point isinside, outside, or coincident with the model.

reprojk(p; yx) and write

�stereo(x, yx) =

Ny�

y=1

M�

k=1

PC(p, reprojk(p, yx)) , (10)

where p = (x, y). To see this, substitute (10) into (3) andobserve that the result is precisely (9).

Note that the column–wise decomposition (10) neithercommits us to optimizing over columns independently, norto ignoring interactions between columns. Such interactionscome into effect when we optimize over the full payoff ma-trix in section 4, and our results will show that widely sepa-rated image regions often interact strongly. The derivationsin this section follow deductively from the indoor Manhat-tan assumption; the only approximation is the following.

Occlusions. We have ignored self–occlusions in (9). Forshort baselines (such as frames sampled over a few sec-onds from a moving camera), this is unproblematic sinceindoor environments tend to be mostly convex from anysingle point of view. Even in highly non–convex environ-ments our system achieves excellent results by integrating3D and monocular features, and enforcing strong globalconsistency, as will be shown in section 6.

3.3. 3D featuresIn this section we explore the context in which a 3D point

cloud is available during inference. The point clouds gen-erated by structure–from–motion systems are typically toosparse for direct reconstruction, but can provide useful cuesalongside monocular and stereo data.

Our graphical model for 3D data is depicted in figure 6.The model M is sampled according to the prior (4), thendepth measurements di are generated for pixels pi. Manysuch measurements will correspond to clutter or measure-ment errors, rather than to the walls represented by M . Ourmodel captures this uncertainty explicitly through the latentvariable ti, which has following interpretation. If ti = ONthen di corresponds to some surface represented explicitlyin M . Otherwise, either ti = IN, meaning some clutterobject within the room was measured, or ti = OUT, inwhich case an object outside the room was measured, such

Frame 0 Frame i

M = {c1, (r1, a1), . . . , ck�1, (rk�1, ak�1), ck} (33)

C(M) =

c

kX

x=0

⇡(x, yx

)�kX

i=0

�(M, i) (34)

ˆM = argmax

M

X

x

⇡(x, yx

)�kX

i=0

�(M, i) (35)

�(M, i) =

8><

>:

log �1, if ci


i


i


ck

= W (37)

ck

< W (38)

logP (M |X) =

z }| {X

x

⇡(x, yx

)�z }| {X

i

�(M, i) (39)

P (M) =

1

Z�1

n1�2n2�3

n3 (40)

logP (M | ) =X

p

logP ( p | a⇤p) + c (41)

⇡mono(x, yx) =X

y

0

logP (

i

| a⇤i

) (42)

P (M) =

1

Z�1

n1�2n2�3

n3 (43)

logP (D|M) =

X

x

⇣X

i2Dx

logP (di

| pi

, yx

)

⌘, (44)

logP (D | M) =

X

x

⇣X

i2Dx

logP (di

| pi

, yx

)

⌘, (45)

logP (X | M) =

X

x

⇡(x, yx

) (46)

logP (I1:K | M) =

X

p2I

o

KX

k=1

z}|{PC (p,

z }| {reproj

k

(p,M)) ,

(47)

Equivalent to canonical stereo formulation subject to indoor Manhattan assumption.



Likelihood For Point Cloud Features

Figure 7. Depth measurements di might be generated by a surfacein our model (represented by ti = ON) or by an object insideor outside the environment (in which case ti = IN,OUT respec-tively).

as through a window. The likelihoods we use are

P (d | p,M, IN) =

⇤�, if 0 < d < r(p;M)

0, otherwise(11)

P (d | p,M,OUT) =

⇤⇥ , if r(p;M) < d < Nd

0 , otherwise(12)

P (d | p,M,ON) = N (d ; r(p;M),⌥) . (13)

where � and ⇥ are determined by the requirement thatthe probabilities sum to 1 and r(p;M) denotes the depthpredicted by M at p. We compute likelihoods on d bymarginalizing,

P (d | p,M) =⌅

t

P (d | p,M, t)P (t) , (14)

where the prior P (t) is a look–up table with three entriesdenoted �IN, �OUT, and �ON.

As explained in section 3, computing the depth of themodel at pixel p requires knowledge only of the floor/wallintersection yx in column x, so we substitute

P (d | p, yx) = P (d | p,M) . (15)

Let D denote all depth measurements, P denote all pixels,and Dx contain indices for all depth measurements in col-umn x. Then

P (M | D,P ) =P (M)⇧

x

⇧

i⇤Dx

P (di | pi, yx) (16)

logP (M | D,P ) =P (M) +⌅

x

�⌅

i⇤Dx

logP (di | pi, yx)⇥,

(17)

which we write in payoff form as

⌃3D(x, yx) =⌅

i⇤Dx

logP (di | pi, yx) (18)

and the penalty function ⇤ remains as in (8).

3.4. Combining featuresWe combine photometric, stereo, and 3D data into a joint

model by assuming conditional independence given M ,

P (M | Xmono, Xstereo, X3D) =

P (M)P (Xmono | M)P (Xstereo | M)P (X3D | M)(19)

Taking logarithms leads to summation over payoffs,

⌃joint(x) = ⌃mono(x) + ⌃stereo(x) + ⌃3D(x) . (20)

3.5. Resolving the floor and ceiling planesWe resolve the equation of the floor and ceiling planes as

follows. If C is the camera matrix for any frame and vv isthe vertical vanishing in that frame, then n = C�1vv is nor-mal to the floor and ceiling planes. We sweep a plane withthis orientation through the scene, recording at each step thenumber of points within a distance ⌅ of the plane (⌅=0.1%of the diameter of the point cloud in our experiments). Wetake as the floor and ceiling planes the minimum and maxi-mum locations such that the plane contains at least 5 points.We found that this simple heuristic worked without failureon our training set.

Let the two non–vertical vanishing points be vl and vr

and let h = vl ⇤ vr . Select any two corresponding pointsxf and xc on the floor and ceiling planes respectively. Thenthe Manhattan homology defined in (1) is given by

H = I + µvvh

T

vv · h , (21)

where µ =< vv,xc,xf ,xc⇤xf⇤h > is the characteristiccross ratio of H .

4. InferenceWe have reduced MAP inference to optimization over a

payoff matrix:

M = argmaxM

⌅

x

⌃(x, yx)�⌅

i

⇤(ci) (22)

In previous work [7] we showed that if an indoor Manhattanmodel M is optimal over image columns [1, x], then the“cropped” model M ⇥, obtained by restricting M to the sub–interval [1, x⇥] x⇥ < x, must itself be optimal over that sub–interval. This permits a dynamic programming solution inwhich M is built up from left to right.

Our algorithm differs from that of [7] in the followingrespects. First, we optimize over general payoff matrices ofthe form (3); whereas neither ⌃stereo nor ⌃3D decomposesas assumed in [7]. Second, we do not include the number ofcorners as a state variable, but instead accumulate penaltiesdirectly into the objective function, which reduces complex-ity by O(K) where K is the number of walls in the model.For completeness we give revised recurrence relations in anappendix.

0

1

G

W

S

,1 21 287



Likelihood For Point Cloud Features

0

1

G

W

S

,1 21 287



P (d | p,M, IN) =

⇤�, if 0 < d < r(p;M)

0, otherwise(11)

P (d | p,M,OUT) =

⇤⇥ , if r(p;M) < d < Nd

0 , otherwise(12)

P (d | p,M,ON) = N (d ; r(p;M),⌥) . (13)


P (d | p,M) =⌅

t

P (d | p,M, t)P (t) , (14)



P (d | p, yx) = P (d | p,M) . (15)


P (M | D,P ) =P (M)⇧

x

⇧

i⇤Dx

P (di | pi, yx) (16)

logP (M | D,P ) =P (M) +⌅

x

�⌅

i⇤Dx


(17)


⌃3D(x, yx) =⌅

i⇤Dx













H = I + µvvh

T

vv · h , (21)



payoff matrix:

M = argmaxM

⌅

x

⌃(x, yx)�⌅

i

⇤(ci) (22)



M = {c1, (r1, a1), . . . , ck�1, (rk�1, ak�1), ck} (33)

C(M) =

c

kX

x=0

⇡(x, yx

)�kX

i=0

�(M, i) (34)

ˆM = argmax

M

X

x

⇡(x, yx

)�kX

i=0

�(M, i) (35)

�(M, i) =

8><

>:

log �1, if ci


i


i


ck

= W (37)

ck

< W (38)

logP (M |X) =

z }| {X

x

⇡(x, yx

)�z }| {X

i

�(M, i) (39)

P (M) =

1

Z�1

n1�2n2�3

n3 (40)

logP (M | ) =X

p

logP ( p | a⇤p) + c (41)

⇡mono(x, yx) =X

y

0

logP (

i

| a⇤i

) (42)

P (M) =

1

Z�1

n1�2n2�3

n3 (43)

logP (D|M) =

X

x

⇣X

i2Dx

logP (di

| pi

, yx

)

⌘, (44)



Combining Features



P (d | p,M, IN) =

⇤�, if 0 < d < r(p;M)

0, otherwise(11)

P (d | p,M,OUT) =

⇤⇥ , if r(p;M) < d < Nd

0 , otherwise(12)

P (d | p,M,ON) = N (d ; r(p;M),⌥) . (13)


P (d | p,M) =⌅

t

P (d | p,M, t)P (t) , (14)



P (d | p, yx) = P (d | p,M) . (15)


P (M | D,P ) =P (M)⇧

x

⇧

i⇤Dx

P (di | pi, yx) (16)

logP (M | D,P ) =P (M) +⌅

x

�⌅

i⇤Dx


(17)


⌃3D(x, yx) =⌅

i⇤Dx













H = I + µvvh

T

vv · h , (21)



payoff matrix:

M = argmaxM

⌅

x

⌃(x, yx)�⌅

i

⇤(ci) (22)



No approximations other than conditional independence and occlusions

M = {c1, (r1, a1), . . . , ck�1, (rk�1, ak�1), ck} (33)

C(M) =

c

kX

x=0

⇡(x, yx

)�kX

i=0

�(M, i) (34)

ˆM = argmax

M

X

x

⇡(x, yx

)�kX

i=0

�(M, i) (35)

ˆM = argmax

M


(36)


(37)


(38)

�(M, i) =

8><

>:

log �1, if ci


i


i


ck

= W (40)

ck

< W (41)

logP (M |X) =

z }| {X

x

⇡(x, yx

)�z }| {X

i

�(M, i) (42)

P (M) =

1

Z�1

n1�2n2�3

n3 (43)

logP ( | M) =

X

p

logP ( p | a⇤p) + c (44)

⇡mono(x, yx) =X

y

0

logP (

i

| a⇤i

) (45)

P (M) =

1

Z�1

n1�2n2�3

n3 (46)

logP (D|M) =

X

x

⇣X

i2Dx

logP (di

| pi

, yx

)

⌘, (47)

logP (D | M) =

X

x

⇣X

i2Dx

logP (di

| pi

, yx

)

⌘, (48)

logP (X | M) =

X

x

⇡(x, yx

) (49)

logP (I1:K | M) =

X

p2I

o

KX

k=1

z}|{PC (p,

z }| {reproj

k

(p,M)) ,

(50)



MAP inference

INFERENCE

M = {c1, (r1, a1), . . . , ck�1, (rk�1, ak�1), ck} (33)

C(M) =

c

kX

x=0

⇡(x, yx

)�kX

i=0

�(M, i) (34)

ˆM = argmax

M

X

x

⇡(x, yx

)�kX

i=0

�(M, i) (35)

�(M, i) =

8><

>:

log(�1), if ci


i


i


ck

= W (37)

ck

< W (38)

2. Background

The Manhattan world assumption was introduced byCoughlan and Yuille[?] over a decade ago and has seenincreasing attention in the computer vision literature overpast years [?, ?, ?, ?, ?]. Furukawa et al. [?] pro-posed a Manhattan–world stereo algorithm based on graphcuts. While their approach is concerned with dense photo–realistic reconstructions, ours is intended to capture seman-tic properties of the scene using a concise representation.The output of their approach — a polygonal mesh — hasno immediate semantic interpretation, whereas our models,though less detailed, come packaged with a direct interpre-tation. A by–product is efficiency: we count computationtime in hundreds of milliseconds, where as Furukawa et al.report waiting more than an hour.

Another approach to interpreting Manhattan worlds is tomodel scenes as a union of cuboids. This approach has along history beginning with Roberts’ 1965 thesis [?], andhas recently been revisited using modern probabilistic tech-niques [?, ?].

Lee et al. [?] first proposed indoor Manhattan models(a sub–class of general Manhattan models) for monocularreconstructions. They used a branch–and–bound algorithmtogether with a line–sweep heuristic for approximate infer-ence. Flint et al. [?] employed a similar model but showeda dynamic programming algorithm that performed exact in-ference in polynomial time. In earlier work [?] Flint et

al. also demonstrated Manhattan reconstructions integratedwith a SLAM system, but this work inferred models fromsingle frames and then extrapolated these forward in time.In contrast, our work incorporates both multiple view geom-etry and 3D points directly into a joint inference procedure.We also learn parameters in a Bayesian framework, whereas neither Lee nor Flint utilized training data in any form.

Felzenszwalb and Veksler [?] posed the reconstruc-tion problem in terms of energy minimization, which theyshowed could be solved using dynamic programming, whileBarinova et al. [?] modeled outdoor scenes using a CRF.However, these approaches do not permit strong geometricconstraints and so cannot be extended to multiple views.

Semantic scene understanding has, broadly speaking,seen less attention within the multiple view community.The CamVid [?] database of outdoor videos with seman-tic segmentations is an important and encouraging excep-tion. Brostow et al. [?] showed that simple structure–from–motion cues lead to pleasing segmentations. Sturgess et al.[?] extended this approach to a CRF framework. We com-pare our method with this approach in section 6.

3. Proposed Model

In this section we describe the indoor Manhattan model.We consider three sensor modalities: monocular image fea-

tures, stereo features, and 3D point clouds. For each wepresent a generative model relating observed features to theManhattan scene structure, which we denote M . For eachsensor modality we show that MAP inference can be re-duced to maximization over a payoff function ⇡(x, y). Thisallows us to present a unified dynamic programming solu-tion in section 4, which efficiently solves MAP inferencefor all three sensor modalities.

General Manhattan environments have structural sur-faces oriented in three cardinal orientations. Indoor Man-hattan environments are a special case that consist of a floorplane, a parallel ceiling plane, and a set of vertical walls ex-tending between them. Each wall extends all the way fromthe floor to ceiling, and walls meet at vertical edges. Wealways consider environments observed from a camera lo-cated between the floor and ceiling. Since each wall ex-tends from floor to ceiling, indoor Manhattan environmentsalways project as a linear chain of walls in the image, asshown in figure ??. Further, the edges at which adjacentwalls meet can be categorized as concave, convex, or oc-cluding, as illustrated in figure ?? and discussed further in[?].

We assume that vanishing points for the three Manhattandirections are given. We use the vanishing point detector de-scribed by Zhang et al. [?] in the monocular setting and thatof [?] in the multiple view setting. It will greatly simplifythe remainder of this paper if we can assume that verticallines in the world appear vertical in the image. To this endwe apply the simple rectification procedure of [?].

We now describe our parametrization for indoor Manhat-tan models. Let the image dimensions be N

x

⇥Ny

. Follow-ing rectification, the vertical seams at which adjacent wallsmeet project to vertical lines, so each image column inter-sects exactly one wall segment. Let the top and bottom ofthe wall in column x be p

x

= (x, yx

) and q

x

= (x, y0x

) re-spectively (depicted in figure ??). Since each p

x

lies on thefloor plane and each q

x

lies on the ceiling plane, we have

p

x

= Hq

x

. (1)

where H is a planar homology [?]. We show how to recoverH in section 3.5. Once H is known, any indoor Manhattanmodel is fully described by the values {y

x

}, leading to thesimple parametrization,

M = {yx

}Nx

x=1 . (2)

We query this parametrization as follows. To check whethera pixel (x0, y0) lies on a vertical or horizontal surface wesimply need to check whether y0 is between y

x0 and y0x0

.If we know the 3D position of the floor and ceiling planesthen we can recover the depth of every pixel as follows. Ifthe pixel lies on the floor or ceiling then we simply back–project a ray onto the corresponding plane. If not, we back–

Reduced to optimisation over payoff matrix:

M = {c1, (r1, a1), . . . , ck�1, (rk�1, ak�1), ck} (33)

C(M) =

c

kX

x=0

⇡(x, yx

)�kX

i=0

�(M, i) (34)

ˆM = argmax

M

X

x

⇡(x, yx

)�kX

i=0

�(M, i) (35)

ˆM = argmax

M

P (M | X) (36)

ˆM = argmax

M


(37)


(38)


(39)

�(M, i) =

8><

>:

log �1, if ci


i


i


ck

= W (41)

ck

< W (42)

logP (M |X) =

z }| {X

x

⇡(x, yx

)�z }| {X

i

�(M, i) (43)

P (M) =

1

Z�1

n1�2n2�3

n3 (44)

logP ( | M) =

X

p

logP ( p | a⇤p) + c (45)

⇡mono(x, yx) =X

y

0

logP (

i

| a⇤i

) (46)

P (M) =

1

Z�1

n1�2n2�3

n3 (47)

logP (D|M) =

X

x

⇣X

i2Dx

logP (di

| pi

, yx

)

⌘, (48)

logP (D | M) =

X

x

⇣X

i2Dx

logP (di

| pi

, yx

)

⌘, (49)

logP (X | M) =

X

x

⇡(x, yx

) (50)

logP (I1:K | M) =

X

p2I

o

KX

k=1

z}|{PC (p,

z }| {reproj

k

(p,M)) ,

(51)



Recursive Sub-problem FormulationWhat is the optimal model up to column x?

ternative systems, though neither comparison is ideal.Our first comparison is with the approach of Brostow et

al. [3], who performed semantic segmentation by traininga per–pixel classifier on structure–from–motion cues. Ourimplementation of their system uses exactly the featuresthey describe, with classes corresponding to the three Man-hattan orientations. While they trained a randomized forest,we trained a multi–class SVM because a reliable SVM li-brary was more readily available to us. Given the marginbetween our results it is unlikely that a different classifierwould significantly change the outcome.

The second comparison is with the monocular approachof Lee et al. [14]. One would of course expect a mul-tiple view approach to outperform a monocular approach,but as one of the very few previous approaches to have ex-plicitly leveraged the indoor Manhattan assumption we feelthis comparison is important to demonstrate the benefit of aBayesian framework and integration of stereo and 3D cues.

The performance of each system is shown in figure 9.Our system significantly out–performs both others. Evenwhen restricted to monocular features, our system outper-forms [3], which has access to 3D cues. This reflects theutility of global consistency and the indoor Manhattan rep-resentation in our approach.

The initialization procedure of [14] fails for 31% of ourtraining images, so at the bottom of figure 9 we show re-sults for their system after excluding these images. Label-ing accuracy increases to within 3% of our monocular–onlyresults, though on the depth error metric a margin of 10%remains. This illustrates the effect of our training procedure,which optimizes for the depth error.

Figure 9 also shows that joint estimation is superior tousing any one sensor modality alone. Anecdotally we findthat using 3D cues alone often fails within large texturelessregions in which the structure–from–motion system failedto track any points, whereas stereo or monocular cues aloneoften perform better in such regions but can lack precisionat corners and boundaries.

Figure 11 shows timing results for our system. For eachtriplet of frames, our system requires on average less thanone second to compute features for all three frames and lessthan 100 milliseconds to perform optimization.

7. ConclusionWe have presented a Bayesian framework for scene un-

derstanding in the context of a moving camera. Our ap-proach draws on the indoor Manhattan assumption intro-duced for monocular reasoning and we have shown thattechniques from monocular and stereo vision can be inte-grated with 3D data in a coherent Bayesian framework.

1This row excludes cases for which [14] was unable to find overlappinglines during initialization.

Algorithm Mean depth er-ror (%)

Labeling ac-curacy (%)

Our approach (full) 14.5 75.5Stereo only 17.4 69.53D only 15.2 71.1Monocular only 24.8 69.2

Brostow et al. [3] 40.6Lee et al. [14] 79.8 45.5excluding failures1 34.1 66.2

Figure 9. Performance on our data–set. Labeling accuracy is thepercentage of correctly labeled pixels over the data–set, and deptherror is a per–pixel average of (23).

In future work we intend to use indoor Manhattan mod-els to reason about objects, actions, and scene categories.We also intend to investigate structural SVMs for learningparameters, which may allow us to relax the conditional in-dependence assumptions between sensor modalities.

8. AppendixRecurrence relations for MAP inference. Let

fout(x, y, a), 1 ⇤ x ⇤ Nx, 1 ⇤ y ⇤ Ny , a ⇧ {1, 2} be themaximum payoff for any indoor Manhattan model M span-ning columns [1, x], such that (i) M contains a floor/wallintersection at (x, y), and (ii) the wall that intersects col-umn x has orientation a. Then fout can be computed byrecursive evaluation of the recurrence relations,

fout(x, y, a) = maxa0⇥{1,2}

⇤⌃⇧

⌃⌅

fup(x, y � 1, a�)� �(x)

fdown(x, y + 1, a�)� �(x)

fin(x, y, a�)� �(x)

(25)

fup(x, y, a) = max�fin(·), fup(x, y � 1, a)

⇥, (26)

fdown(x, y, a) = max�fin(·), fdown(x, y + 1, a)

⇥, (27)

fin(x, y, a) = maxx0<x

�fout(x

�, y�, a) +�⇥, (28)

� =x⌥

i=x0

⇥(i, y�) . (29)

Here we have treated fin, fup, and fdown simply as nota-tional placeholders; for their interpretations in terms of sub–problems see [7]. Finally, the base cases are

fout(0, y, a) = 0 ⌃y, a (30)fup(x, 0, a) = ⌅ ⌃x, a (31)

fdown(x,Nx, a) = ⌅ ⌃x, a . (32)




















⇤⌃⇧

⌃⌅

fup(x, y � 1, a�)� �(x)

fdown(x, y + 1, a�)� �(x)

fin(x, y, a�)� �(x)

(25)


⇥, (26)


⇥, (27)


�fout(x

�, y�, a) +�⇥, (28)

� =x⌥

i=x0

⇥(i, y�) . (29)



fdown(x,Nx, a) = ⌅ ⌃x, a . (32)




















⇤⌃⇧

⌃⌅

fup(x, y � 1, a�)� �(x)

fdown(x, y + 1, a�)� �(x)

fin(x, y, a�)� �(x)

(25)


⇥, (26)


⇥, (27)


�fout(x

�, y�, a) +�⇥, (28)

� =x⌥

i=x0

⇥(i, y�) . (29)



fdown(x,Nx, a) = ⌅ ⌃x, a . (32)

Recurrence relations Boundary Conditions

O(WH)Flint, Mei, Murray, and Reid, “A Dynamic Programming Approach to Reconstructing Building Interiors”, In ECCV 2010



RESULTS

Input• 3 frames sampled at 1 second intervals

• Camera poses from SLAM

• Point cloud (approx. 100 points)

Dataset• 204 triplets from 10 video sequences

• Image dimensions 640 x 480

• Manually annotated ground truth



RESULTS



RESULTS

[1] Flint, Mei, Murray, and Reid, “A Dynamic Programming Approach to Reconstructing Building Interiors”, ECCV 2010

[2] Brostow, Shotton, Fauqueur, and Cipolla, “Segmentation and recognition using structure from motion point clouds”, ECCV 2008

[3] Lee, Hebert, and Kanade, “Geometric reasoning for single image structure recovery”, CVPR 2009

Algorithm Mean depth error (%) Labeling error (%)

Our approach (full) 14.5 24.5Stereo only 17.4 30.53D only [1] 15.2 28.9Monocular only 24.8 30.8

Brostow et al. [2] 39.4Lee et al. [3] 79.8 54.5



RESULTS

Monocular Features160ms

Stereo Features730ms

3D Features9ms

Inference102ms

997ms mean processing time per instance



RESULTS

Sparse texture Non-Manhattan



RESULTS

Poor Lighting Conditions



RESULTS

Clutter



RESULTSFailure Cases



Failure Cases

RESULTS



• We wish to leverage multiple-view geometry for scene understanding.

• Indoor Manhattan models are a simple and meaningful model family.

• We have presented a probabilistic model for monocular, stereo, and point cloud features.

• A fast and exact inference algorithm exists.

• Results show state-of-the-art performance.

SUMMARY


Date post:	29-Nov-2014
Category:	Technology
Upload:	alex-flint
View:	343 times
Download:	0 times

ICCV 2011 Presentation

Technology