SceneGraphNet: Neural Message Passing for 3D Indoor Scene ...

transcript

SceneGraphNet: Neural Message Passing for 3D Indoor Scene Augmentation

Yang Zhou Zachary While Evangelos Kalogerakis

University of Massachusetts Amherst

{yangzhou,zwhile,kalo}@cs.umass.edu

Abstract

In this paper we propose a neural message passing ap-

proach to augment an input 3D indoor scene with new ob-

jects matching their surroundings. Given an input, poten-

tially incomplete, 3D scene and a query location (Figure

1), our method predicts a probability distribution over ob-

ject types that fit well in that location. Our distribution

is predicted though passing learned messages in a dense

graph whose nodes represent objects in the input scene and

edges represent spatial and structural relationships. By

weighting messages through an attention mechanism, our

method learns to focus on the most relevant surrounding

scene context to predict new scene objects. We found that

our method significantly outperforms state-of-the-art ap-

proaches in terms of correctly predicting objects missing in

a scene based on our experiments in the SUNCG dataset.

We also demonstrate other applications of our method, in-

cluding context-based 3D object recognition and iterative

scene generation.

1. Introduction

With the increasing number of 3D models and scenes

becoming available in online repositories, the need for ef-

fectively answering object queries in 3D scenes has become

greater than ever. A common type of scene queries is to

predict plausible object types that match well the surround-

ing context of an input 3D scene. For example, as shown

in Figure 1, given a query location close to a TV stand and

a room corner, a likely choice for an object to be added in

that location can be a speaker, or less likely a plant.

We designed a neural message passing method to pre-

dict a probability distribution over object types given query

locations in the scene. Our predicted distributions can be

used in various vision and graphics tasks. First, our method

can enhance 3D object recognition in scenes by taking into

account the scene context (Figure 2). Second, it can also be

used to automatically populate 3D scenes with more objects

by evaluating our probability distribution at different loca-

tions in the scene (Figure 3). Another related application is

to provide object type recommendations to designers while

interactively modeling a 3D scenes.

What object type can go in that location?

SceneGraphNet Predictions

SpeakerLamp

Shelves

PlantOttoman

Query location

relations

Figure 1. SceneGraphNet captures relationships between objects

in an input 3D scene through iterative message passing in a dense

graph to make object type predictions at query locations.

Our method models the scene as a graph, where nodes

represent existing objects and edges represent various spa-

tial and structural relationships between them, such as sup-

porting, surrounding, adjacency, and object co-occurrence

relationships. The edges are not limited only to neighbor-

ing objects, but can also capture long-range dependencies

in the scene e.g., the choice of a sofa in one side of a room

can influence the selection of other sofas, chairs, or tables in

the opposite side of the room to maintain a plausible object

set and arrangement.

Our method is inspired by graph neural network ap-

proaches that learn message passing in graphs [1, 6, 7] to

infer node representations and interactions between them.

Our method learns message passing to aggregate the sur-

rounding scene context from different objects. It addresses

a number of challenges in this setting. First, scene objects

may have multiple types of relationships between them e.g.,

a nightstand can be adjacent to a bed, while at the same

time it is placed symmetrically wrt to another nightstand

surrounding the bed from both sides. We found that object

relationships are more effectively captured through neural

network modules specialized for each type of relationship.

In addition, we found that predictions are more plausible,

when we model not only local or strictly hierarchical object

relationships, but also long-range relationships captured in a

dense graph. Since we do not know a priori which relation-

ships are most important for predicting objects at query lo-

cations, we designed an attention mechanism to weigh dif-

ferent messages i.e., we find which edges are more impor-

tant for making object predictions. Finally, we found that

aggregating messages from multiple objects are better han-

dled with memory-enabled units (GRUs) rather than other

simpler schemes, such as summation or max-pooling.

3D Shape Recognition through MVCNN

Ottoman

Pillow

3D Shape Recognition through MVCNN & SceneGraphNet

1%1%1%1%3%

…Ottoman

ToyPillow

Ottoman

Figure 2. Context-based object recognition. Left: Object recogni-

tion using a multi-view CNN [19] without considering the scene

context. Right: Improved recognition by fusing the multi-view

CNN and SceneGraphNet predictions based on scene context.

Incomplete scene Full scene

Figure 3. Iterative scene synthesis. Given an incomplete scene,

our method is used to populate it progressively with more objects

at their most likely locations predicted from SceneGraphNet.

We tested our method against several alternatives in a

large dataset based on SUNCG. We evaluated how accu-

rately different methods predict object types that are in-

tentionally left out from SUNCG scenes. Our method im-

proves prediction of missing object types by a large mar-

gin of 16% (51% → 67%) compared to the previous best

method adopted for this task. Our contribution is two-fold:

• a new graph neural network architecture to model

short- and long-range relationships between objects in

3D indoor scenes.

• an iterative message passing scheme, reinforced with

an attention mechanism, to perform object-related pre-

diction tasks in scenes, including spatial query an-

swering, context-based object recognition, and itera-

tive scene synthesis.

2. Related Work

Our work is related to learning methods for synthesizing

indoor 3D scenes by predicting new objects to place in them

either unconditionally or conditioned on some input, such

as spatial queries. Our work is also related to graph neural

networks and neural message passing.

Indoor scene synthesis. Early work in 3D scene model-

ing employed hand-engineered kernels and graph walks to

retrieve objects from a database of 3D models that are com-

patible with the surrounding context of an input scene [3, 5].

Alternatively, Bayesian Networks have been proposed to

synthesize arrangements of 3D objects by modeling object

co-occurence and placement statistics [4]. Other probabilis-

tic graphical models have also been used to model pairwise

compatibilities between objects to synthesize scenes given

input sketches of scenes [23] or RGBD data [2, 10]. These

earlier methods were mostly limited to small scenes. Their

generalization ability was limited due to the coarse scene

relationships, shape representations, and statistics captured

in their shallow or hand-engineered statistical models.

With the availability of large scene datasets such as

SUNCG [18], more sophisticated learning methods have

been proposed. Henderson et al. [9] proposed a Dirichlet

process mixture model to model higher-order relationships

of objects in terms of co-occurences and relative displace-

ments. More related to our approach are deep networks pro-

posed for scene synthesis and augmentation. Wang et al.

[21, 15] use image-based CNNs to encode top-down views

of input scenes, then decode them towards object category

and location predictions. In a concurrent work, Zhang et al.

[24] use a Variational Auto-Encoder coupled with a Gener-

ative Adversarial Network to generate scenes represented in

a matrix where each column represents an object with loca-

tion and geometry attributes. Most relevant to our work is

GRAINS [12], a recursive auto-encoder network that gen-

erates scenes represented as tree-structured scene graphs,

where internal nodes represent groupings of objects accord-

ing to various relationships (e.g., surrounding, supporting),

and leaves represent object categories and sizes. GRAINS

directly encodes only local dependencies between objects of

the same group in its tree-structured architecture. The tree

representing the scene is created through hand-engineered

heuristic grouping operations. Concurrently to our work,

Wang et al. [20] proposed a graph neural network for scene

synthesis. The edges in the graph represent spatial and se-

mantic relationships of objects, however, they are pruned

through heuristics. Our method instead models scenes as

dense graphs capturing both short- and long-range depen-

dencies between objects. Instead of hand-engineering pri-

orities for object relationships, our methods learns to attend

the most relevant relationships to augment a scene.

Graph Neural Networks. A significant number of meth-

ods has been proposed to model graphs as neural networks

[16, 8, 7, 13, 6, 1, 17]. Our method is mostly related to

approaches that perform message passing along edges to

update node representations through neural network oper-

ations [1, 6, 7]. In our graph network, nodes are connected

with multiple edges allowing the exchange of information

over multiple structural relationships across objects, and we

also use an attention mechanism that weigh the most rele-

vant messages for scene object predictions. We also adapt

the graph structure, node representations, message ordering

and aggregation in the particular setting of 3D indoor scene

modeling.

3. Method

The input to our method is a set of 3D models of objects

arranged in a scene s. We assume that the current objects in

Figure 4. An example of the graph structure used for neural message passing in SceneGraphNet for a bedroom scene. Left: an input 3D

scene. Middle: graph structure and object relationships we modeled (some relationships, e.g. dense “co-occurrence” and “next-to” ones,

are skipped for clarity). Right: messages received by the object “desk” from other nodes in the graph for all different types of relationships.

the scene are labeled based on their types (e.g., sofa, table,

and so on). Given a query location p in the scene (Figure 1),

the output of our method is a probability distribution over

different object types, or categories, P (C|p, s) expressing

how likely is for objects from each of these categories to

fit well in this location and match the scene context. The

probability distribution can be used in various ways, such

as simply selecting an object from a category with the high-

est probability to be placed in the query location for scene

augmentation tasks, or presenting a list of object type rec-

ommendations to a designer ordered by their probability for

interactive scene modeling tasks. Alternatively, for object

recognition tasks our distribution can be combined with a

posterior that attempts to predict the object category based

only on individual shape data.

To determine the target probability distribution, our

method first creates a graph whose nodes represent objects

in the scene and edges represent different types of rela-

tionships between objects (Figure 4). Information flows in

this graph by iteratively passing learned messages between

nodes connected through edges. We note that the graph we

use for message passing should not be confused with scene

graphs (often in the form of trees) used in graphics file for-

mats for representing objects in scenes based on hierarchi-

cal transformations. Our scene graph representation has a

much richer structure and is not limited to a tree.

In the following sections, we explain message passing

(Section 3.1), different strategies for designing the graph

structure (Section 3.2), the target distribution prediction

(Section 3.3), and applications (Section 3.4).

3.1. Message passing

Figure 5 illustrates our message passing and underlying

neural architecture. Each node i in our scene graph rep-

resents a 3D object. The node internally carries a vecto-

rial representation hi encoding information about both the

shape and also its scene context based on the messages it

receives from other nodes. The messages are used to update

the node representations so that these reflect the scene con-

text captured in these messages. New messages are emit-

ted from nodes, which result in more information exchange

about scene context. In this manner, the message passing

procedure runs iteratively. In the following paragraphs, we

describe the steps of this message passing procedure.

Initialization. Each node representation is initialized

based on the shape representation xi at the node:

h(0)i = finit(xi;winit) (1)

where finit is a two-layer MLP with learnable parame-

ters winit outputting a 100-dimensional node representa-

tion (details are provided in the supplementary material). In

our implementation, the shape representation is formed by

the concatenation of three vectors xi = [ci,pi,di] where

ci is a one-hot vector representing its category, pi ∈ R3 is

its centroid 3D position in the scene, di ∈ R3 is its scale

(its oriented bounding box lengths).

Messages. A message from a node k to node i encodes

an exchange of information, or interaction, between their

corresponding shapes. The message carries information

based on the corresponding node representations h(t)i and

h(t)k and also depends on the type of relationship r between

the two nodes (e.g., a different message is computed for a

“surrounding” relationship or for a “supporting” relation-

ship). We discuss different types of relationships in Section

3.2. At step t the message m(r,t)k→i from node k to another

node i is computed as:

m(r,t)k→i = f (r)

msg(h(t)k ,h

(t)i ;w(r)

msg) (2)

where f(r)msg is a two-layer MLP with learnable parameters

w(r)msg (weights are different per relationship r) outputting a

100-dimensional message representation.

Weights on messages. Some messages might be more

important, or more relevant for prediction, than others.

…………

… …… …

𝒙𝒙𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑡𝑡

𝒙𝒙𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑡𝑡𝒙𝒙𝑛𝑛𝑐𝑐𝑛𝑛𝑐𝑡𝑡𝑑𝑑𝑡𝑡𝑐𝑐𝑛𝑛𝑑𝑑𝑡𝑡𝒉𝒉𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑡𝑡𝒉𝒉𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑡𝑡𝒉𝒉𝑛𝑛𝑐𝑐𝑛𝑛𝑐𝑡𝑡𝑑𝑑𝑡𝑡𝑐𝑐𝑛𝑛𝑑𝑑𝑡𝑡

𝒙𝒙𝑏𝑏𝑑𝑑𝑑𝑑𝑡𝑡𝒙𝒙𝑤𝑤𝑐𝑐𝑤𝑤𝑤𝑤𝑡𝑡

𝒉𝒉𝑏𝑏𝑑𝑑𝑑𝑑𝑡𝑡𝒉𝒉𝑤𝑤𝑐𝑐𝑤𝑤𝑤𝑤𝑡𝑡

MLP layers

𝑎𝑎𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑,𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑎𝑎𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑,𝑛𝑛𝑐𝑐𝑛𝑛𝑐𝑡𝑡𝑑𝑑𝑡𝑡𝑐𝑐𝑛𝑛𝑑𝑑

𝑎𝑎𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑,𝑏𝑏𝑑𝑑𝑑𝑑𝑎𝑎𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑,𝑤𝑤𝑐𝑐𝑤𝑤𝑤𝑤

𝒎𝒎𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐→𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑡𝑡𝒎𝒎𝑛𝑛𝑐𝑐𝑛𝑛𝑐𝑡𝑡𝑑𝑑𝑡𝑡𝑐𝑐𝑛𝑛𝑑𝑑→𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑡𝑡

𝒎𝒎𝑏𝑏𝑑𝑑𝑑𝑑→𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑡𝑡𝒎𝒎𝑤𝑤𝑐𝑐𝑤𝑤𝑤𝑤→𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑡𝑡

𝒉𝒉𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑡𝑡

Concatenate

aggregated messages

Weighted messagesNode representation

Updated

representation

𝒉𝒉𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑡𝑡+1

Co-occur

relation

Next-to

relation

relations

.. ……

Figure 5. Overview of our message passing and underlying neural network architecture. We take the example in Figure 4 to illustrate a

single message passing iteration.

Thus, we implemented a form of attention mechanism for

messages. During message passing, each message m(r,t)k→i is

scaled (multiplied) with a scalar weight ak,i computed as

follows:

ak,i = fatt(xk,xi;watt) (3)

where fatt is a two-layer MLP followed by a sigmoid layer

with learnable parameters watt. Note that the attention

weights are computed from the raw shape representations.

We practically found that this strategy had more stable be-

havior (better convergence) compared to updating them us-

ing both the shape and latent representations. Furthermore,

weights that are almost 0 imply that the interaction between

two nodes is negligible. This can be used in turn to dis-

card edges and accelerate message passing at test time with-

out sacrificing performance. Note also that the shape rep-

resentations include positions of objects, thus weights are

expected to correlate with object distances.

Message aggregation. Messages are aggregated

through GRU modules learned for each type of relationship.

Specifically, for each node i, we pass a sequence of mes-

sages {m(r,t)k→i}k∈N (i) coming from all the different nodes

k ∈ N (i) connected to it through a series of GRU units

(where N (i) represents the set of nodes emitting to node i).

Each GRU unit receives as input a message from an emit-

ting node k and the previous GRU unit in the sequence, and

produces an aggregated message g(r,t)i,j (j = 1, ..., |N(i)| is

an index for each GRU unit) as follows:

g(r,t)i,j = f

(r)GRU (g

(r,t)i,j−1, ak,i ·m

(r,t)k→i;w

(r)GRU ) (4)

where w(r)GRU are learnable GRU parameters per type of

relationship r. The last GRU unit in the sequence pro-

duces the final aggregated message, denoted simply as g(r,t)i

(dropping the last index j = |N (i)| in the GRU sequence

for clarity). We note that the first GRU unit in the sequence

receives an all-zero message. The order of messages also

matters, as generally observed in recurrent networks. In our

implementation, the messages from emitting nodes to the

node i are ordered according to the Euclidean distance of

their corresponding object centroid to the centroid of the

object represented at node i (from furthest to closest).

Node representation update. Finally, for each node i,

its latent representation is updated to reflect the transmitted

context captured in all messages sent to it across all differ-

ent types of relationships. Specifically, the aggregated mes-

sages g(r,t)i from all relationship types are concatenated,

and its latent representation is updated as follows:

h(t+1)i = fupd(h

(t)i , Concat({gr,t

i }r∈R);wupd) (5)

where fupd is a two-layer MLP with learnable weights

wupd, and R is the set of relationships that we discuss in

the following section.

3.2. Graph structure and Object Relationships

A crucial component in our method is the underlying

graph structure used for message passing. One obvious

strategy is to connect each node (object) in a scene with

neighboring nodes (objects) based e.g., on Euclidean dis-

tance between them. However, we found that connecting

the graph via simple neighboring relationships resulted in

poor object prediction performance for a number of reasons.

First, it is hard to define a global distance threshold or single

number of nearest neighbors that work well for all different

scenes. Second, and most importantly, such “neighboring”

relationships are coarse and often ambiguous. For exam-

ple, a pillow and a nightstand are both “nearby” a bed, yet

their structural and spatial relationships wrt the bed can be

quite different i.e., the pillow is “on top of” the bed, while a

nightstand is “adjacent” or “next to” the bed. We found that

representing more fine-grained relationships in the graph

significantly increases the prediction performance of our

method. We also found that forming dense graphs captur-

ing long-range interactions between scene objects was bet-

ter than using sparse graphs or constraining them to be tree-

structured, as discussed in our results section. Below we

discuss the different types of relationships we used to form

connections between nodes for message passing.

“Supporting” relationship. A node i is connected to a

node k via a directed edge of this relationship if the object

represented by node i supports, or is “on top of”, the object

at node k. The relationship can be detected by examining

the bounding boxes of the two objects (details are provided

in the supplementary material).

“Supported-by” relationship. This is the opposite rela-

tionship to the previous one i.e., a node i is connected to a

node k via an edge of this relationship if the object at node

i is supported by node k. Note that a “supporting” relation-

ship i → k implies a “supported-by’ relationship k → i.

Yet, we use different weights for these relationships i.e., the

node i sends a message to k learned through a MLP whose

weights are different from the MLP used to infer the mes-

sage from k to i. Using such assymetric messages improved

performance compared to using symmetric ones, since in

this manner we capture the directionality of this relation-

ship. Having exclusively “supporting” relationships (and

not vice versa) also resulted in lower performance.

“Surrounding” relationship. If there is a set of objects

surrounding another object i.e., the set has objects of same

size whose bounding boxes are placed under a reflective or

rotational symmetry around a central object, then all these

objects are connected to the central object via directed edges

of the “surrounding” relationship type (details are provided

in the supplementary material).

“Surrounded-by” relationship. This is the opposite re-

lationship to the previous one. A central object is con-

nected to the objects surrounding it via directed edges of

‘surrounded-by” relationship type.

‘Next-to” relationship. Two nodes i and k are connected

via an undirected edge of “next-to” relationship type if the

node i is adjacent to node k and lie on the same underlying

supporting surface. Note that in contrast to the previous re-

lationships, this is a symmetric relationship, thus messages

are inferred by the same MLP in both directions.

“Co-occuring” relationship. Two nodes i and k are con-

nected via an undirected edge of “co-occuring” relationship

type if their objects co-exist in the same scene. This type

of relationship obviously results in a fully connected graph.

As discussed earlier in the introduction, the selection of ob-

jects to place in a location may be strongly influenced by

other objects further away from that location. We found

that capturing such long-range, “object co-occurence” inter-

actions, in combination with our attention mechanism that

learns to weigh them, improved the prediction performance

of our method, even if this happens at the expense of more

computation time. We note that edges can be dynamically

discarded during message passing by examining the atten-

tion weights in the graph, thus accelerating the execution at

test time. For scenes containing 50-100 objects, message

passing in our graph takes a few seconds at test time.

Figure 4 illustrates our scene graph structure with all the

different types of relationships for a toy scene example. We

note that in all our training and test scenes, there is a “floor”

node and “wall” nodes, since these objects are also con-

stituent parts of indoor scenes and are useful for making

predictions (i.e., when we want to predict an object hanging

on the wall).

3.3. Prediction

Given a query location in a form of a point p in the scene,

we form a special “empty” node m in our graph represent-

ing the “missing object” to predict. The node is initialized

to an all-zero shape category and size representation vector,

and 3D position set to the p. We connect it to other nodes in

the graph based on the relationships discussed in the previ-

ous section (including our directed relationships which can

be inferred by examining the relative position of the query

point wrt other objects in the scene). This special node

along with its edges form our final scene graph s based on

the input scene and query.

We then execute message passing in our graph. The

node representations are updated at each time step syn-

chronously (i.e., all messages are computed before they are

sent to other nodes), including the representation h(t)m of our

special node. Message passing is executed up to t = T

time steps (practically T = 3 iterations in our implemen-

tation). Then the representation in our special node is de-

coded through a two-layer MLP and a softmax to predict

our target probability distribution:

P (c|p, s) = fpred(h(T )m ;wpred) (6)

where fpred represents the MLP and wpred its learnable pa-

rameters. For interactive modeling tasks, we also found use-

ful to predict the size dm of the object to place in the scene.

This is done through one more MLP regressing to object

size from the learned node representation.

Training. All the MLPs in our network, including the

GRU modules, are trained jointly. To train our graph net-

work, given a training scene, we remove a random object

from it (excluding “walls” and “floor”). Our training aims

Indoor incomplete scene

Top-down view

Location probability

Complete scene

Side view Sofa

Figure 6. Prediction of most likely object categories to add in

the scene and associated placement distributions. Left: An in-

put scene. Middle: the top-two most plausible categories to add

in the scene (sofa and chair) along with the evaluated placement

probability at each scene location. Right: Resulting scene with a

sofa and chair placed at the most likely location.

to predict correctly the category of the removed object. This

means that we place a query in its location, and execute our

message passing interactive procedure. Based on the pre-

dicted distribution for the corresponding empty node, we

form a categorical cross entropy loss for the missing ob-

ject’s category. To train the MLP regressing to the object

size, we additionally use the L2 loss between the ground-

truth size and the predicted one.

Implementation details. Training is done through the

Adam optimizer [11] with learning rate 0.001, beta coef-

ficients are (0.9, 0.999) and weight decay is set to 10−5.

The batch size is set to 350 scenes. The model converges

around 8K iterations for our largest room dataset. We pick

the best model and hyper-parameters based on hold-out val-

idation performance. Our implementation is in Pytorch

and is provided at: https://github.com/yzhou359/

3DIndoor-SceneGraphNet .

3.4. Applications

Object recognition in scenes. Objects in 3D scenes may

not be always tagged based on their category. One approach

to automatically recognize 3D objects is to process them in-

dividually through standard 3D shape processing architec-

tures operating on a volumetric (e.g. [22]), multi-view (e.g.,

[19]), or point-based representations (e.g., [14]). However,

such approaches are not perfect and are susceptible to mis-

takes, especially if there are objects whose shapes are sim-

ilar across different categories. To improve object recogni-

tion in scenes, we can also use the contextual predictions

from SceneGraphNet. Specifically, given a posterior distri-

bution P (C|o) for an object extracted by one of the above

3D deep architectures given its raw shape representation o

(voxels, multi-view images, or points), and our posterior

distribution extracted from the node at the location of the

object P (C|p, s), we can simply take the product of these

two distributions and re-normalize. In our experiments,

we used a popular multi-view architecture [19], and found

that the resulting distribution yields better predictions com-

pared to using multi-view object predictions alone. Figure

2 demonstrates a characteristic example.

Incremental scene synthesis. Our posterior distribution

is conditioned at a query location to evaluate the fitness of

an object category at that location. We can use our dis-

tribution to incrementally synthesize a scene by evaluating

P (C|p, s) over a grid of query locations p in the scene (e.g.,

a 2D regular grid of locations on the floor, or the surface of

a randomly picked object in a scene, such as desk), pick-

ing the location and object category that maximizes the dis-

tribution, then retrieving a 3D model from a database that

matches the predicted size at the location (we note that we

assume a uniform prior distribution over locations here).

Figure 6 shows examples of the most likely object predic-

tion and the evaluated probability distribution for placement

across the scene. Figure 3 shows an example of iterative

scene synthesis. Although some user supervision is eventu-

ally required to tune placement, specify object orientation,

and stop the iterative procedure, we believe that our method

may still provide helpful guidance for 3D scene design.

4. Results and Evaluation

We evaluated our method both qualitatively and quanti-

tavely. Below we discuss our dataset, evaluation metrics,

comparisons with alternatives, and our ablation study.

Dataset. Following [21], we experimented with four

room types (6K bedrooms, 4K living rooms, 3K bathrooms,

and 2K offices) from the SUNCG dataset. The number

of object categories varies from 31 to 51. Dataset statis-

tics are presented in the supplementary material. To ensure

fair comparisons with other methods [12, 21] that assume

rectangular rooms with four walls as input, we excluded

SUNCG rooms that do not have this layout (we note that our

method does not have this limitation). All methods were

trained for each of the four room types separately. We used

a random split of 80% of scenes for training, 10% for hold-

out validation, 10% for testing (the same splits and hyper-

parameter tuning procedure were used to train all methods).

Evaluation metrics. For the task of scene augmentation

conditioned on a 3D query point location, we used the

following procedure for evaluation. Given a test SUNCG

scene, we randomly remove one of the objects (excluding

floor and walls). Then given a query location set to cen-

troid of this object, we compute the object category pre-

diction distributions from all competing methods discussed

below. First, we measure the classification accuracy i.e.,

whether the most likely object category prediction produced

by a method agrees with the ground-truth category of the

removed object. We also evaluate top-K classification ac-

curacy, which measures whether the ground-truth category

is included in the K most probable predictions produced by

a method. The reason for using the top-K accuracy is that

some objects from one category (e.g., a laptop on a desk)

can be replaced with objects from another category (e.g., a

book) in a scene without sacrificing the scene plausibility.

Thus, if a method predicts book as the most likely category,

then laptop, the top-K accuracy (K>1) would be unaffected.

We also evaluate the accuracy of predictions for object size.

Objects are annotated with physical size in SUNCG, thus

we report the error in terms of centimeters (cm) averaged in

all three dimensions across all test queries.

Comparisons. We compare with two state-of-the-art

methods for scene synthesis: GRAINS [12] and Wang et

al.’s view-based convolutional prior [21]. Wang et al. [21]

trains a CNN module that takes as input a top-down view of

a scene and outputs object category predictions conditioned

on a 2D location in that view. To compare with Wang et al.

[21], we project our input 3D query location onto the corre-

sponding 2D location in the top-down view of the scene. We

use their publicly available code to train their module. To

compare with GRAINS [12], we first encode the input scene

as a tree based on its heuristics (again, we use their publicly

available code). For a fair comparison (i.e., provide same

input information to GRAINS as in our method), we also

include an “empty” node representing the query, connect-

ing it to rest of the tree based on the same procedure as ours.

The node has the same all-zero category and size represen-

tation as in our method, and its 3D position is set according

to the query position expressed relatively to its sibling in

the tree (GRAINS encodes relative positions of objects rel-

ative to sibling nodes). The tree is processed through their

recursive network, then decoded it to the same tree with

the goal to predict the category of the “empty” node. To

train GRAINS, we used the same loss as our method. The

same dataset and splits were used for all methods. We note

that GRAINs has 5M learnable parameters, Wang et al. has

42M , while SceneGraphNet has much fewer (1.5M ).

Results. Table 1 shows the top-K accuracy averaged over

the whole dataset across all room types for all different

methods (K = 1, 3, 5). Our method outperforms the two

other competing methods with a significant margin ranging

from 10.2% for top-5 accuracy to 16.4% for top-1 accuracy.

In the supplementary material, we include detailed evalua-

tion for each room type. Figure 8 shows the top-3 accuracy

per each room type binned according to the total number of

objects in the input scene. We observe that the performance

gap between our method and others tends to increase for

larger scenes with more objects. Figure 7 shows the most

likely predicted categories for SceneGraphNet, GRAINS

[12], and Wang et al. [21] for two input scenes and queries.

MethodAverage

Top1 Top3 Top5

GRAINs[12] 44.2 63.9 73.6

Wang et al. [21] 50.9 70.7 79.3

SGNet-tree 61.1 79.7 87.0

SGNet-sparse 60.0 78.6 86.5

SGNet-co-occur 56.5 75.4 83.3

SGNet-sum 57.7 77.6 85.1

SGNet-max 63.1 81.5 87.3

SGNet-vanilla-rnn 64.8 82.3 88.5

SGNet-no-attention 60.3 79.1 85.8

SGNet-dist-weights 63.8 81.6 87.9

SceneGraphNet (full model) 67.3 83.8 89.5

Table 1. Top-K accuracy for our scene augmentation task for

different methods and variants of SceneGraphNet.

In terms of object size, we compare our method with

GRAINS, since GRAINS is also able to predict the oriented

bounding box size of the object to add in the scene. We

found that GRAINS has 38cm average error in size predic-

tion, while our method results in much lower error: 26cm.

Ablation Study. We also evaluated our method against

several other degraded variants of it based on the same

dataset and classification evaluation metrics (Table 1). We

tested the following variants. SGNet-tree: we build the

scene graph using only supporting, supported-by, surround-

ing, and surrounded-by relationships such that the result-

ing graph has guaranteed tree structure (we break cycles,

if any, by prioritizing supporting relationships). SGNet-

sparse builds the graph with all the relationships except for

the dense “co-occurring” ones, resulting in a sparse graph

structure. SGNet-co-occur: we build a fully connected

graph with only the “co-occurring” relationships between

edges. SGNet-sum uses the full graph structure, yet ag-

gregates messages using a summation over their represen-

tations instead of a GRU module. SGNet-max instead ag-

gregates messages using max-pooling over their represen-

tations. SGNet-vanilla-rnn instead aggregates messages

using a vanilla RNN. SGNet-no-attention uses the full

graph structure, yet does not use the weighting mechanism

on messages (all messages have same weight). SGNet-

dist-weights instead computes weights on messages by set-

ting them according to Euclidean distance between nodes

(weights are set as αk,i = c · exp−||dk,i||

b , where dk,i is the

Euclidean distance between object centroids k and i, c and

b are parameters set through hold-out validation). We found

that our method outperforms all these degraded variants.

Performance wrt iteration number. After completing

the first iteration of message passing, the top-5 accuracy is

72.7%. In the second iteration, it increases to 88.2%, and

in the third one, the performance arrives at 89.5%. The per-

formance oscillates around 89% after the third iteration.

window

dresser

picture

switch

hanger

shelvesQuery location

3D Scene Ours GRAINs

Bathroom

Bedroom

Query location

Wang et al.

shelves

mirror

switch

partition

person

curtain

shelves

bathtub

indoor lamp

shelves

tower hanger

toilet

toilet paper

shelves

Figure 7. Comparison of object category predictions for two 3D scenes and query positions (red points) across different methods. Given

the input scenes and queries (left), we show the predicted category distributions and rendered scenes with objects added from the most

likely predicted category per each method.

Method Bed Living Bath Office Avg

MVCNN 69.6 55.8 43.4 67.8 59.2

MVCNN+ Ours 79.9 74.7 56.4 73.0 72.2

Table 2. Object recognition accuracy using MVCNN alone [19]

and using the MVCNN together with our ShapeNetGraph poste-

rior evaluated for the objects in our test scenes.

Object recognition in 3D scenes. For each object in our

scene dataset, we use a MVCNN [19] to predict its ob-

ject category P (C|o) given its multi-view representation

o. Then we multiply this predicted distribution with the

context-aware predicted posterior P (C|p, s) from Scene-

GraphNet. Table 2 shows the average classification ac-

curacy for object recognition with and without combining

the SceneGraphNet’s posterior with the MVCNN one. Our

method significantly improves the accuracy in this context-

based object recognition task by 13%.

Timings. It takes 30 hours to train our network in our

largest room dataset (bedrooms) with 5K training scenes

measured on a GeForce 1080Ti GPU. At test time, given

an average-sized scene with 40 objects, our method takes

around 0.58 sec to infer the distribution given the query.

5. Discussion

We presented a neural message passing method that op-

erates on a dense scene graph representation of a 3D scene

to perform scene augmentation and context-based object

recognition. There are several avenues for future work. Our

method is currently limited to predicting object categories

Room size: number of objects in a single room

GRAINs Wang et al. Ours

100.0%

<20 20‐30 30‐40 40‐50 >500.0%

100.0%

<20 20‐30 30‐40 40‐50 >50

100.0%

<20 20‐30 30‐40 40‐50 >50

Office Top‐3 Accuracy

100.0%

<20 20‐30 30‐40

Bathroom Top‐3 Accuracy

Bedroom Top‐3 Accuracy Living room Top‐3 Accuracy

Figure 8. Average top-3 classification accuracy for rooms binned

according to number of objects they contain.

at a given location. It could be extended to instead generate

objects and whole scenes from scratch. Our message pass-

ing update scheme is not guaranteed to converge, nor we

forced it to be a contraction map. Oscillations can happen

as in other message passing algorithms e.g., loopy belief

propagation. Training and testing with the same number of

iterations helps preventing unstable behavior. It would be

fruitful to investigate modifications that would result in any

theoretical convergence guarantees. Enriching the set of re-

lationships in our scene graph representation could also help

improving the performance. Another limitation is that our

method currently uses the coarse-grained labels provided in

SUNCG, as other methods did. Fine-grained and hierarchi-

cal classification are also interesting future directions.

Acknowledgements. This research is funded by NSF

(CHS-161733). Our experiments were performed in the

UMass GPU cluster obtained under the Collaborative Fund

managed by the Massachusetts Technology Collaborative.

References

[1] Peter W Battaglia, Razvan Pascanu, Matthew Lai, Danilo

Rezende, and Koray Kavukcuoglu. Interaction networks for

learning about objects, relations and physics. In Advances in

Neural Information Processing Systems, NIPS, 2016. 1, 2

[2] Kang Chen, Yu-Kun Lai, Yu-Xin Wu, Ralph Martin, and Shi-

Min Hu. Automatic semantic modeling of indoor scenes

from low-quality rgb-d data using contextual information.

ACM Trans. Graph., 33(6), 2014. 2

[3] Matthew Fisher and Pat Hanrahan. Context-based search for

3d models. ACM Trans. Graph., 29(6), 2010. 2

[4] Matthew Fisher, Daniel Ritchie, Manolis Savva, Thomas

Funkhouser, and Pat Hanrahan. Example-based synthesis of

3d object arrangements. ACM Trans. Graph., 31(6), 2012. 2

[5] Matthew Fisher, Manolis Savva, and Pat Hanrahan. Char-

acterizing structural relationships in scenes using graph ker-

nels. ACM Trans. Graph., 30(4), 2011. 2

[6] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol

Vinyals, and George E Dahl. Neural message passing for

quantum chemistry. In International Conference on Machine

Learning, ICML, 2017. 1, 2

[7] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive

representation learning on large graphs. In Advances in Neu-

ral Information Processing Systems, NIPS, 2017. 1, 2

[8] William L. Hamilton, Rex Ying, and Jure Leskovec. Rep-

resentation learning on graphs: Methods and applications.

IEEE Data Eng. Bull., 40(3), 2017. 2

[9] Paul Henderson and Vittorio Ferrari. A generative model

of 3d object layouts in apartments. CoRR, abs/1711.10939,

2017. 2

[10] Zeinab Sadeghipour Kermani, Zicheng Liao, Ping Tan, and

Hao (Richard) Zhang. Learning 3d scene synthesis from an-

notated RGB-D images. Computer Graph. Forum, 35(5),

2016. 2

[11] Diederik P. Kingma and Jimmy Ba. Adam: A method for

stochastic optimization. CoRR, abs/1412.6980, 2014. 6

[12] Manyi Li, Akshay Gadi Patil, Kai Xu, Siddhartha Chaudhuri,

Owais Khan, Ariel Shamir, Changhe Tu, Baoquan Chen,

Daniel Cohen-Or, and Hao Zhang. Grains: Generative re-

cursive autoencoders for indoor scenes. ACM Trans. Graph.,

38(2), 2019. 2, 6, 7

[13] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard

Zemel. Gated graph sequence neural networks. International

Conference on Learning Representations, ICLR, 2015. 2

[14] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.

Pointnet: Deep learning on point sets for 3d classification

and segmentation. In Conference on Computer Vision and

Pattern Recognition, CVPR, 2017. 6

[15] Daniel Ritchie, Kai Wang, and Yu-An Lin. Fast and flex-

ible indoor scene synthesis via deep convolutional genera-

tive models. In Conference on Computer Vision and Pattern

Recognition, CVPR, 2019. 2

[16] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Ha-

genbuchner, and Gabriele Monfardini. The graph neural net-

work model. IEEE Trans. on Neural Networks, 20(1), 2009.

[17] Kristof T Schutt, Farhad Arbabzadah, Stefan Chmiela,

Klaus R Muller, and Alexandre Tkatchenko. Quantum-

chemical insights from deep tensor neural networks. Nature

Communications, 8, 2017. 2

[18] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Mano-

lis Savva, and Thomas Funkhouser. Semantic scene comple-

tion from a single depth image. In Conference on Computer

Vision and Pattern Recognition, CVPR, 2017. 2

[19] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and

Erik G. Learned-Miller. Multi-view convolutional neural

networks for 3d shape recognition. In International Con-

ference on Computer Vision, ICCV, 2015. 2, 6, 8

[20] Kai Wang, Yu-An Lin, Ben Weissmann, Manolis Savva, An-

gel X. Chang, and Daniel Ritchie. Planit: Planning and in-

stantiating indoor scenes with relation graph and spatial prior

networks. ACM Trans. Graph., 38(4), 2019. 2

[21] Kai Wang, Manolis Savva, Angel X. Chang, and Daniel

Ritchie. Deep convolutional priors for indoor scene synthe-

sis. ACM Trans. Graph., 37(4), 2018. 2, 6, 7

[22] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin-

guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d

shapenets: A deep representation for volumetric shapes. In

Conference on Computer Vision and Pattern Recognition,

CVPR, 2015. 6

[23] Kun Xu, Kang Chen, Hongbo Fu, Wei-Lun Sun, and Shi-

Min Hu. Sketch2scene: sketch-based co-retrieval and co-

placement of 3d models. ACM Trans. Graph., 32(4), 2013.

[24] Zaiwei Zhang, Zhenpei Yang, Chongyang Ma, Linjie Luo,

Alexander Huth, Etienne Vouga, and Qixing Huang. Deep

generative modeling for scene synthesis via hybrid represen-

tations. ACM Trans. Graphics, to appear, 2019. 2

SceneGraphNet: Neural Message Passing for 3D Indoor Scene ...

Documents