Download - Margarita Grinvald, Fadri Furrer, Tonci Novkovic, Jen Jen ... · Margarita Grinvald, Fadri Furrer, Tonci Novkovic, Jen Jen Chung, Cesar Cadena, Roland Siegwart, Juan Nieto Abstract—To

Volumetric Instance-Aware Semantic Mapping and 3D Object Discovery

Margarita Grinvald, Fadri Furrer, Tonci Novkovic, Jen Jen Chung, Cesar Cadena, Roland Siegwart, Juan Nieto

Abstract— To autonomously navigate and plan interactions inreal-world environments, robots require the ability to robustlyperceive and map complex, unstructured surrounding scenes.Besides building an internal representation of the observedscene geometry, the key insight towards a truly functionalunderstanding of the environment is the usage of higher-levelentities during mapping, such as individual object instances. Wepropose an approach to incrementally build volumetric object-centric maps during online scanning with a localized RGB-Dcamera. First, a per-frame segmentation scheme combines anunsupervised geometric approach with instance-aware semanticobject predictions. This allows us to detect and segmentelements both from the set of known classes and from other,previously unseen categories. Next, a data association steptracks the predicted instances across the different frames.Finally, a map integration strategy fuses information abouttheir 3D shape, location, and, if available, semantic class intoa global volume. Evaluation on a publicly available datasetshows that the proposed approach for building instance-levelsemantic maps is competitive with state-of-the-art methods,while additionally able to discover objects of unseen categories.The system is further evaluated within a real-world roboticmapping setup, for which qualitative results highlight the onlinenature of the method.

I. INTRODUCTION

Robots operating autonomously in unstructured, real-world environments cannot rely on a detailed a priori mapof their surroundings for planning interactions with sceneelements. They must therefore be able to robustly perceivethe complex surrounding space and acquire task-relevantknowledge to guide subsequent actions. Specifically, to learnaccurate 3D object models for tasks such as grasping and ma-nipulation, a robotic vision system should be able to discover,segment, track, and reconstruct objects at the level of theindividual instances. However, real-world scenarios exhibitlarge variability in object appearance, shape, placement, andlocation, posing a direct challenge to robotic perception [1].Further, such settings are usually characterized by open-set conditions, i.e. the robot will inevitably encounter novelobjects of previously unseen categories.

Computer vision algorithms have shown impressive resultsfor the tasks of detecting individual objects in RGB imagesand predicting for each a per-pixel semantically annotatedmask [2], [3]. However, these methods alone do not providea 3D representation of the scene, and, therefore, cannot bedirectly used for robot navigation or manipulation planning.

This work was supported in part by ABB Corporate Research and inpart by the Swiss National Science Foundation (SNF) through the NationalCentre of Competence in Research on Digital Fabrication.

The authors are with the Autonomous Systems Lab, ETH Zurich, 8092Zurich, Switzerland (e-mail: {mgrinvald, fadri, ntonci, chungj, cesarc,rsiegwart, nietoj}@ethz.ch).

(a) Object-centric Map (b) Ground Truth Instance Map

(c) Semantic Instance Segmentation (d) Geometric Segmentation [4]

Monitor Keyboard Suitcase TableChair Mouse Refrigerator PlantBackpack Cup Microwave Unknown

Fig. 1: Reconstruction and object-level segmentation of an office scene usingthe proposed approach. Besides accurately describing the observed surfacegeometry, the final object-centric map in Figure (a) carries information aboutthe location and dense 3D shape of the individual object instances in thescene. As opposed to a geometry-only segmentation from our previouswork [4] shown in Figure (d), recognized objects are segmented as oneinstance despite their non-convex shape (blue circle) and assigned a semanticcategory shown in Figure (c). At the same time, the proposed approachdiscovers novel, previously unseen elements of unknown class (red circle).Note that different colors in Figure (a) and Figure (b) represent the differentinstances, and that a same instance in the prediction and ground truth is notnecessarily of the same color. The scene is reconstructed from sequence231 of the SceneNN [10] dataset. The accompanying video available athttp://youtu.be/Jvl42VJmYxg shows the progressive mapping results.

On the other hand, dense 3D scene reconstruction has beenextensively studied by the robotics community. A number ofworks extend the task to detecting and segmenting individualobject instances in the built map without any prior knowledgeabout their exact appearance [4]–[9]. Recent learning-basedapproaches can locate semantically meaningful objects in re-constructed scenes while dealing with substantial intra-classvariability [6]–[9]. Still, these methods only detect objectsfrom a fixed set of classes used during training, thus limitinginteraction planning to a subset of the observed elements. Incontrast, purely geometry-based methods [4], [5] are ableto discover novel, previously unseen scene elements, underopen-set conditions. However, such approaches tend to over-segment the reconstructed objects and additionally fail toprovide any semantic information about them, making high-level scene understanding and task planning impractical.

arX

iv:1

903.

0026

8v1

[cs

.RO

] 1

Mar

201

9

http://youtu.be/Jvl42VJmYxg

This paper presents an approach to incrementally buildgeometrically accurate volumetric maps of the environmentthat additionally contain information about the individualobject instances observed in the scene. In particular, theproposed object-oriented mapping framework retrieves thedense shape and pose of recognized semantic objects, aswell as of newly discovered, previously unobserved object-like instances. The proposed system builds on top of theincremental geometry-based scene segmentation approachfrom our previous work in [4] and extends it to producea complete instance-aware semantic mapping framework.Figure 1 shows the sample object-centric map of an officescene reconstructed with the proposed approach.

The system takes as input the RGB-D stream of a depthcamera with known pose.1 First, a frame-wise segmentationscheme combines an unsupervised geometric segmentationof depth images [5] with semantic object predictions fromRGB [2]. The use of semantic allows to infer the categoryof some of the 3D segments predicted in a frame, as wellas to group segments by the object instance they belongto. Next, the tracking of the individual predicted instancesacross multiple frames is addressed by matching per-framepredictions to existing segments in the global map via adata association strategy. Finally, observed surface geometryand segmentation information are integrated into a globalTruncated Signed Distance Field (TSDF) map volume. Tothis end, the Voxblox volumetric mapping framework [11]is extended to enable the incremental fusion of class andinstance information within the reconstruction. By relying ona volumetric representation that explicitly models free spaceinformation, i.e. distinguishes between unknown space andobserved, empty space, the built maps can be directly usedfor safe robotic navigation and motion planning purposes.Furthermore, object models reconstructed with the voxel gridexplicitly encode surface connectivity information, relevantin the context of robotic manipulation applications.

The capabilities of the proposed method are demonstratedin two experimental settings. First, the proposed instance-aware semantic mapping framework is evaluated on of-fice sequences from the real-world SceneNN [10] datasetto compare against previous work on progressive instancesegmentation of 3D scenes. Lastly, we show qualitativeresults for an online mapping scenario on a robotic platform.The experiments highlight the robustness of the presentedincremental segmentation strategy, and the online nature ofthe framework.

The main contributions of this work are:

• A combined geometric-semantic segmentation schemethat extends object detection to novel, previously unseencategories.

• A data association strategy for tracking and matchinginstance predictions across multiple frames.

• Evaluation of the framework on a publicly availabledataset and within an online robotic mapping setup.

1 Please note that the current work focuses entirely on mapping, hencelocalization of the camera is assumed to be given.

II. RELATED WORKA. Object detection and segmentation

In the context of object recognition in real-world envi-ronments, computer vision algorithms have recently shownsome impressive results. Such success is driven by the ad-vances in deep learning using Convolutional Neural Network(CNNs). Several architectures have been proposed for thetasks of detecting objects in RGB images and predictingfor each one a bounding box or a semantically annotatedsegmentation mask. State-of-the-art bounding box regressionsystems include single-stage approaches such as SSD [12]and YOLOv2 [13], and double-stage methods like the region-based Faster R-CNN [14]. Moving from bounding boxesto instance-level mask proposals, the recent Mask R-CNNframework [2] adopts the same architecture as in [14] andextends it with a branch for predicting per-pixel semanticallyannotated masks for each of the detected object instances.Mask R-CNN achieves state-of-the-art results on the COCOinstance-level semantic segmentation task [15].

One of the major limitations of deep learning basedinstance segmentation methods is that they require extensiveamounts of training data in the form of annotated masksfor the specified object categories. Such annotated data canbe expensive or even infeasible to acquire for all possibleobject categories that may be encountered in a real-worldscenario. Moreover, these algorithms can only handle objectsfrom the fixed set of classes provided during training, thusfailing to correctly segment and classify other, previouslyunseen object categories.

Some recent works aim to relax the requirement forlarge amounts of pixel-wise semantically annotated trainingdata. MaskX R-CNN [16] adopts a transfer method whichonly requires a subset of the data to be labeled at trainingtime. SceneCut [17] and its Bayesian extension in [3] alsooperate under open-set conditions and are able to detectand segment novel objects of unknown classes. However,beyond detecting object instances in individual image frames,these methods alone do not provide a comprehensive 3Drepresentation of the scene and, therefore, cannot be directlyused for planning tasks such as manipulation or navigation.

B. Semantic object-level mappingRecent developments in deep learning have also enabled

the integration of rich semantic information within real-timeSimultaneous Localization and Mapping (SLAM) systems.The work in [18] fuses semantic predictions from a CNN onRGB-D image pairs into a dense map built with a SLAMframework. However, conventional semantic segmentation isunaware of object instances, i.e. it does not disambiguatebetween individual instances that belong to the same cat-egory. Thus, the approach in [18] does not provide anyinformation about the geometry and relative placement ofthe individual objects in the scene. Similar work in [19]additionally offers a strategy to incrementally segment the 3Dreconstruction using geometric cues from the depth images.However, geometry-based approaches tend to result in over-segmentation of articulated scene elements. Thus, without

instance-level information, a joint semantic and geometricsegmentation is not enough to group individually categorizedparts of the scene into distinct separate objects. Indeed,the instance-agnostic semantic segmentation in these worksfails to build semantically meaningful maps that model theindividual instances present in the scene.

Previous work has addressed the task of mapping at thelevel of individual objects. SLAM++ [20] builds object-oriented maps by detecting recognized elements in RGB-Ddata, but is limited to work with a database of objects forwhich exact geometric models need to be known in advance.A number of other works have addressed the task of detectingand segmenting individual semantically meaningful objectsin 3D scenes without predefined shape templates [4]–[9].Recent learning-based approaches can segment individualinstances of semantically annotated objects in reconstructedscenes with little or no prior information about their exactappearance while at the same time dealing with substantialintra-class variability [6]–[9]. However, by relying on astrong supervisory signal of the predefined classes duringtraining, a purely learning-based object segmentation failsto discover novel objects of unknown class during mapping.As a result, these methods either fail to map objects thatdo not belong to the set of known categories and for whichno semantic labels are predicted [6], [7], [9], or wronglyassign such previously unseen instances to one of the knownclasses [8]. In a real-world robotic interaction scenario,detecting objects only from a fixed set of classes specifiedduring training limits interaction planning to a subset of allthe observed scene elements.

In contrast, purely geometry-based methods in [4], [5]operate under open-set conditions and are able to discovernovel, previously unobserved objects in the scene. Thework in [5] provides a complete and exhaustive geometricsegmentation of the scene. Similarly, the Incremental ObjectDatabase (IODB) in [4] performs a purely geometric segmen-tation from depth data to reconstruct the shape and locationof individual segments and build a consistent database ofunique 3D object models. However, as mentioned previously,geometry-based approaches can result in unwanted over-segmentation of non-convex objects. Furthermore, by notproviding semantic information, the two methods disallowhigh-level interaction planning. In addition to a completegeometric segmentation of the scene, the work in [21] per-forms object recognition on such segments from a databaseof known objects. While able to discover new, previouslyunseen objects and to provide for some semantic information,the main drawback lies in the requirement for exact 3D ge-ometric models of the recognized objects to be known. Thisis not applicable to real-world environments, where objectswith novel shape variations are inevitably encountered on aregular basis.

Closely related to the approach presented in this paperis the recent work in [22], with the similar aim of build-ing dense object-oriented semantic 3D maps. The workpresents an incremental geometry-based segmentation strat-egy, coupled with the YOLO v2 [13] fast bounding box

object detector to identify and merge geometric segmentsthat are detected as part of the same instance. One of thekey differences to our approach is the choice of scenerepresentation. Their system relies on the dense RGB-DSLAM system from [23] and stores the reconstructed 3Dmap using the surfel-based approach proposed in [24]. Whilesurfels allow for efficient handling of loop closures, they onlystore the surface of the environment and do not explicitlyrepresent observed free space [25]. That is, a surfel-basedmap does not distinguish between unseen and seen-but-empty space, and thus cannot be directly used for planningof robotic navigation or manipulation tasks where knowl-edge about free space is essential for safe operation [26].Furthermore, visibility determination and collision detectionin surfel scenes can be significantly harder because of thelack of surface connectivity information. Therefore, as withall other approaches relying on sparse point or surfel cloudsrepresentations [6], [7], the object-oriented maps built in [22]cannot be immediately used in those robotic settings wherean explicit distinction between unobserved space and freespace is required.

Conversely, the volumetric TSDF-based representationadopted in this work does not discard valuable free spaceinformation and explicitly distinguishes observed emptyspace from unknown space in the 3D map. In contrast toall previous approaches, the proposed method is able toincrementally provide densely reconstructed volumetric mapsof the environment that contain shape and pose informationabout both recognized and unknown object elements inthe scene. The reconstructed maps are expected to directlybenefit navigation and interaction planning applications.

III. METHOD

The approach proposed here for incremental object-centricmapping consists of four steps deployed at every incomingRGB-D frame: (i) geometric segmentation, (ii) semanticinstance-aware segmentation refinement, (iii) data associa-tion, and (iv) map integration. First, the incoming depthmap is segmented according to a convexity-based geometricapproach that yields segment contours which accuratelydescribe real-world physical boundaries (Section III-A). Thecorresponding RGB frame is processed with the Mask R-CNN framework to detect object instances and compute foreach of these a per-pixel semantically annotated segmen-tation mask. The per-instance masks are used to semanti-cally label the corresponding depth segments and to mergesegments detected as belonging to the same geometricallyover-segmented, non-convex object instance (Section III-B).A data association strategy matches segments discovered inthe current frame and their comprising instances to the onesalready stored in the map (Section III-C). Finally, segmentsare integrated into the dense 3D map, where a fusion strategykeeps track of the individual segments discovered in thescene (Section III-D). An example illustrating the individualstages of the proposed approach is shown in Figure 2.

Fig. 2: The individual stages of the proposed approach for incremental object-level mapping are illustrated here with an example. At each new frame, theincoming RGB image is processed with the Mask R-CNN network to detect object instances and predict for each a semantically annotated mask. At thesame time, a geometric segmentation decomposes the depth image into a set of convex 3D segments. The predicted semantic masks allow to infer classinformation for the corresponding depth segments and to refine over-segmentation of non-convex objects by grouping segments by the object instance theybelong to. Next, a data association strategy matches segments predicted in the current frame to their corresponding instance in the global map to retrievefor each a map-consistent label. Finally, dense geometry and segmentation information from the current frame are integrated into the global map volume.

A. Geometric segmentation

Building on the assumption that real-world objects exhibitoverall convex surface geometries, each incoming depthframe is decomposed into a set of object-like convex 3Dsegments following the geometry-based approach introducedin [4]. At every frame t, surface convexity and the 3Ddistance between adjacent depth map vertices are combinedto generate a set Rt of closed 2D regions ri in the currentdepth image and a set St of corresponding 3D segments si.Figure 2 shows the sample output of this stage.

B. Semantic instance-aware segmentation refinement

To complement the unsupervised geometric segmentationof each depth frame with semantic object instance informa-tion, the corresponding RGB images are processed with theMask R-CNN framework [2]. The network detects and clas-sifies individual object instances and predicts a semanticallyannotated segmentation mask for each of them. Specifically,for each input RGB frame the output is a set of objectinstances, where the k-th detected instance is characterizedby a binary mask Mk and an object category ck. Figure 2shows the sample output of Mask R-CNN.

The segmentation masks offer a straightforward way toassociate each of the detected instances with one or morecorresponding 3D depth segments si ∈ St. Pairwise 2Doverlaps pi,k between each ri ∈ Rt and each predictedbinary mask Mk are computed as the number of pixels inthe intersection of ri and Mk normalized by the area of ri:

pi,k =|ri ∩Mk||ri|

. (1)

For each region ri ∈ Rt the highest overlap percentage piand the index ki of the corresponding mask Mk are foundas:

pi = maxk

pi,k (2)

ki = arg maxk

pi,k . (3)

If pi > τp, the corresponding 3D segment si is assignedthe object instance label oi = ki and a semantic categoryci = cki . Multiple segments in St assigned to the same objectinstance label oi value indicate an over-segmentation of non-convex, articulated shapes being refined through semanticinstance information. The unique set of all object instancelabels oi assigned to segments si ∈ St in the current frame isdenoted by Ot. All segments si ∈ St for which no mask Mk

in the current frame exhibits enough overlap are assignedoi = ci = 0, denoting a geometric segment for which nosemantic instance information could be predicted.

C. Data association

Because the frame-wise segmentation processes each in-coming RGB-D image pair independently, it lacks any spatio-temporal information about corresponding segments andinstances across the different frames. Specifically, this meansthat it does not provide an association between the set ofpredicted segments St and the set of segments St+1. Further,segments belonging to the same object instance might beassigned different oi label values across two consecutiveframes, since these represent mask indices valid only withinthe scope of the frame in which such masks were predicted.

A data association step is proposed here to track corre-sponding geometric segments and predicted object instancesacross frames. To this end, we define a set of persistentgeometric labels L and a set of persistent object instancelabels O which remain valid throughout the entire mappingsession. In particular, each sj from the set of segments Sstored in the map is defined by a unique geometric labellj ∈ L through a mapping L(sj) = lj . At each frame wethen look for a mapping Lt(si) = lj that matches predictedsegments si ∈ St to corresponding segments sj ∈ S .Similarly, within the scope of a frame we seek to define amapping It(oi) = om that matches object instances oi ∈ Otto persistent instance labels om ∈ O stored in the map.

To track spatial correspondences between segments si ∈St identified in the current depth map and the set S of

segments in the global map it is only necessary to considerthe set Sv ⊂ S of map segments visible in the current cameraview. The pairwise 3D overlap Πi,j is computed for eachsi ∈ St and each sj ∈ Sv as the number of points in segmentsi that, when projected into the global map frame using theknown camera pose, correspond to a voxel which belongs tosegment sj . For each segment sj ∈ Sv , the highest overlapmeasure Πj and the index ij of the corresponding segmentsi ∈ St are found as

Πj = maxi

Πi,j (4)

ij = arg maxi

Πi,j . (5)

Each segment sj ∈ Sv with Πj > τπ determines thepersistent label mapping for the corresponding maximallyoverlapping segment sij ∈ St from the current depth frame,i.e. Lt(sij ) = L(sj). The τπ threshold value is set to20, and is used to prevent poorly overlapping global mapsegment labels from being propagated to the current frame.All segments si ∈ St that did not match to any segmentsj ∈ Sv are assigned a new persistent label lnew as Lt(si) =lnew. It is worth noting that, in contrast to previous work onsegment tracking across frames [5], two or more segmentsfrom the current frame cannot be matched to a same segmentin the global map. Without this constraint, information abouta unified region in the map now being segmented in twoor more parts would be lost, and thus initial wrong under-segmentations could not be fixed over time.

We introduce here the notation Φ(lj , om) to denote thepairwise count in the global map between a persistentsegment label lj ∈ L and a persistent instance label om ∈ O.Φ(lj , om) is used here to determine the mapping It(oi) = omfrom instance labels oi ∈ Ot to instance labels om ∈ O.Specifically, for each segment si ∈ St with a correspondingoi 6= 0 and no It(oi) defined yet, the persistent objectlabel om with the highest pairwise count Φ(Lt(si), oj) > 0is identified. The object label oi is then mapped to omas It(oi) = om. Remaining oi with no mapping It(oi)found are assigned a new persistent instance label onew asIt(oi) = onew. As before, we take special care to not discardvaluable instance segmentation information by preventingmultiple labels oi ∈ Ot from mapping to the same persistentlabel om ∈ O.

The result of this data association step is a set of 3Dsegments si ∈ St from the current frame, each assigneda persistent segment label lj = L(si). Further, the corre-sponding object instance label is matched to a persistentlabel om = It(oi). Additionally, each segment si ∈ St isassociated with the semantic object category ci predicted byMask R-CNN (Section III-B).

D. Map integration

The 3D segments discovered in the current frame, someof which enriched with class and instance information,are fused into a global volumetric map. To this end, theVoxblox [11] TSDF-based dense mapping framework is

extended to additionally encode object segmentation infor-mation. After projecting the segments into the global TSDFvolume using the known camera pose, voxels correspondingto each projected 3D point are updated to store the incominggeometric segment label information, following the approachintroduced in [4]. Additionally, for each si ∈ St integratedinto the map at frame t with corresponding oi 6= 0, thepairwise count between lj = L(si) and the object instanceom = It(oi) and the pairwise count between lj and the classci are incremented as

Φ(lj , om) = Φ(lj , om) + 1 (6)Ψ(lj , ci) = Ψ(lj , ci) + 1 . (7)

Each 3D segment sj ∈ S in the global map volume is thendefined by the set of voxels assigned to the persistent label lj .If the segment represents a recognized, semantically anno-tated instance then it is also associated with an object labelom = arg maxom Φ(lj , om) and a corresponding semanticclass cj = arg maxcj Ψ(lj , cj).

IV. EXPERIMENTS

The proposed approach for incremental semantic objectsegmentation and scene reconstruction is evaluated on aLenovo laptop with an Intel Xeon E3-1505M eight-core CPUat 3.00 GHz and an Nvidia Quadro M2200 GPU with 4 GBof memory only used for the Mask R-CNN component. TheMask R-CNN code is based on the publicly available im-plementation from Matterport,2 with the pre-trained weightsprovided for the Microsoft COCO dataset [15]. In all of thepresented experimental setups, maps are built from RGB-Dvideo with a resolution of 640x480 pixels.

To compare against previous work [8], we evaluate the 3Dsegmentation accuracy of the proposed dense object-level se-mantic mapping framework on real-world indoor scans fromthe SceneNN [10] dataset, and improve over the baselinefor most of the evaluated scenes. We additionally report onthe runtime performance of the individual components of theproposed system over the evaluated sequences.

The framework is further evaluated within an online map-ping setting on a robotic platform across an entire officefloor. Although the proposed object-level mapping frame-work operates at only 1 Hz, qualitative results in the form of asemantically annotated object-centric reconstruction validatethe online nature of the approach and show its benefits inreal-world, open set conditions.

A. Instance-aware semantic segmentation

Several recent works explore the task of semantic instancesegmentation of 3D scenes. The majority of these, however,take as input the full reconstructed scene, either processingit in chunks or directly as a whole. Because such methodsare not constrained to progressively fusing predictions frompartial observations into a global map but can learn fromthe entire 3D layout of the reconstructed scene, these arenot directly comparable to the approach presented in this

2https://github.com/matterport/Mask RCNN

https://github.com/matterport/Mask_RCNN

Sequence ID Bed

Cha

ir

Sofa

Tabl

e

Boo

ks

Ref

rige

rato

r

Tele

visi

on

Toile

t

Bag

Ave

rage

Pham

etal

.[8

]

011 - 75.0 50.0 100 - - - - - 75.0 52.1016 100 0.0 0.0 - - - - - - 33.3 34.2030 - 54.4 100 55.6 14.3 - - - - 56.1 56.8061 - - 91.7 33.3 - - - - - 62.5 59.1078 - 33.3 - 0.0 47.6 100 - - - 45.2 34.9086 - 80.0 - 0.0 0.0 - - - 0.0 20.0 35.0096 0.0 87.5 - 37.5 0.0 - 0.0 - 50 29.2 26.5206 - 58.3 100 60.0 - - - - 100 79.6 41.7223 - 12.5 - 75.0 - - - - - 43.8 40.9255 - - - - - 75.0 - - - 75.0 48.6

TABLE I: Comparison to the 3D semantic instance-segmentation approach from Pham et al. [8]. Per-class AP is evaluated using an IoU threshold of 0.5for each of the 10 evaluated sequences from the SceneNN [10] dataset. The class-averaged mAP value is compared to the results presented in [8]. Theproposed approach improves over the baseline for 7 of the 10 sequences evaluated, however it is worth noting that the reported mAP values are evaluatedon a smaller set of classes compared to the ones from [8].

work. Among the frameworks that instead explore online,incremental dense instance-aware semantic mapping, thework in [8] is, to the best of our knowledge, the only oneto present quantitative results in terms of the achieved 3Dsegmentation accuracy. While a comparison to the resultsin [8] does not provide any insight into the performance ofthe proposed unsupervised object discovery strategy, it canhelp to assess the efficacy of the semantic instance-awaresegmentation component of our system.

In their work, Pham et al. [8] report instance-based 3Dsegmentation accuracy results for the NYUDv2 40 classtask, which includes commonly-encountered indoor objectclasses, as well as structural, non-object categories, such aswall, window, door, floor, and ceiling. This set of classesis well-suited for semantic segmentation tasks in which thegoal is to classify and label every single element, eithervoxel of surfel, of the 3D scene. Indeed, the approach in [8]initially employs a purely semantic segmentation strategy,and later clusters the semantically annotated 3D scene intoindividual instances. However, a set of classes which includesnon-object categories does not apply to the object-basedsegmentation approach proposed in this work. Therefore,rather than training on a class-set that does not meet therequirements and goals of the proposed framework, we reliedon a Mask R-CNN model trained on the 80 Microsoft COCOobject classes [15]. We then evaluated the segmentationaccuracy on the 9 object categories in common between theNYUDv2 40 COCO class tasks. Specifically, we picked the9 categories that have an unambiguous one-to-one mappingbetween the two sets.

The proposed approach is evaluated on the 10 indoorsequences from the SceneNN [10] dataset for which [8]reports instance-based segmentation results. For each scene,the per-class Average Precision (AP) is computed using anIntersection over Union (IoU) threshold of 0.5 over thepredicted 3D segmentation masks. As [8] only providesclass-averaged mean Average Precision (mAP) values foreach scene, these are compared with mAP averaged overthe 9 evaluated categories. The results in Table I show that

the proposed approach outperforms the method [8] on 7 ofthe 10 evaluated sequences, however it is worth noting againthat the reported mAP values are computed over a smallerset of classes. Figure 4 additionally shows examples of thereconstructed object-level maps for the evaluated sequences.

Lastly, the running times of the individual components ofthe framework, averaged over the 10 evaluated sequences, areshown in Table II. The numbers indicate that the presentedsystem is capable of running at approximately 1 Hz on inputof standard 640x480 resolution.

B. Online reconstruction and object mapping

The proposed system is evaluated in a real-life online map-ping scenario. The robotic setup used for evaluation consistsof a collaborative dual arm ABB YuMi robot mounted onthe omnidirectional Clearpath Ridgeback mobile base. Theplatform is equipped with two PrimeSense RGB-D cameras,respectively facing forwards and downwards at 45 degrees,and one visual-inertial sensor only used for localization. Thecomplete setup is shown in Figure 3a.

Within the course of 5 minutes, the mobile base wasmanually steered along a trajectory through an entire officefloor. Real-time poses of the robot were estimated througha combination of visual-inertial and wheel odometry andonline feature-based localization in an existing map builtand optimized with Maplab [27]. During scanning, the RGB-D stream of the two depth cameras is recorded to be laterfed through our mapping framework at a frame rate of1 Hz, emulating real-time on-board operation. That is, anyframes that exceed the processing abilities of the system arediscarded and not used to reconstruct the object-level mapof the observed scene. The accompanying video illustratesthe progressive output of the incremental reconstruction andsegmentation on the recorded sequence.

Qualitative results for the final object-centric map recon-structed at 2 cm voxel resolution are shown in Figure 3.Despite only a subset of the incoming RGB-D frames areintegrated into the map volume, the resulting reconstructionof the environment densely describes the observed surface

(a) (b) (c)

(d)

Fig. 3: Figure (a) shows the robotic platform used for the online mapping experiment of an office floor. The map is reconstructed from RGB-D datarecorded with two Primesense cameras mounted on an ABB YuMi robot attached to a Clearpath Ridgeback mobile base. The final map in Figure (d) isreconstructed at a voxel size of 2 cm. Figure (b) shows a detail of the map where individual objects identified in the scene are represented with differentcolors. The corresponding semantic categories of the recognized instances are shown in Figure (c) using the same color coding as in Figure 1.

geometry. The system is further able to detect recognizedobjects of known class, and to discover novel, previouslyunseen object-like elements in the scene. The final volumetricmap additionally provides free space information, relevantfor safe planning of robotic navigation and interaction tasks.

V. CONCLUSIONS

We presented a framework for online volumetric instance-aware semantic mapping from RGB-D data. By reasoningjointly over geometric and semantic cues, a frame-wisesegmentation approach is able to infer high-level categoryinformation about detected and recognized elements, and todiscover novel objects in the scene, for which no previousknowledge about their exact appearance is available. Thepartial segmentation information is incrementally fused intoa global map and the resulting object-level semanticallyannotated volumetric maps are expected to directly benefitboth navigation and manipulation planning tasks.

Component Time (ms)Mask R-CNN * 407Depth segmentation * 753Data association 136Map integration 276

TABLE II: Measured execution times of each stage of the proposedincremental object-level mapping framework, averaged over the 10 evaluatedsequences from the SceneNN [10] dataset with RGB-D input of 640x480resolution. Inference through Mask R-CNN runs on GPU, while the remain-ing stages are implemented on CPU. The map resolution is set here to 1 cmvoxels. Note that the components with * can be processed in parallel.

Real-world experiments validate the online nature ofthe proposed incremental reconstruction and segmentationframework. However, to achieve real-time capabilities, theruntime performance of the individual components requiresfurther optimization. A future research direction involvesinvestigating the optimal way to fuse RGB and depth infor-mation within a unified per-frame object detection, discoveryand segmentation framework.

(a) Sequence 078 (b) Sequence 011 (c) Sequence 086 (d) Sequence 206 (e) Sequence 030

Fig. 4: Sample object-level reconstructions of the evaluated sequences from the SceneNN [10] dataset incrementally built with the proposed mappingapproach. The final maps encode information about the dense 3D geometry and location of the individual objects. The framework is able to discover novel,previously unseen 3D object-like elements in the scene and to provide a semantically-refined segmentation of the recognized instances. All scenes arereconstructed using a voxel size of 1 cm. Note that different colors in the map represent the different instances.

ACKNOWLEDGMENT

The authors would like to thank T. Aebi for his help incollecting data for the office floor mapping experiment.

REFERENCES

[1] C. C. Kemp, A. Edsinger, and E. Torres-Jara, “Challenges forRobot Manipulation in Human Environments [Grand Challenges ofRobotics],” IEEE Robot. Automat. Mag., vol. 14, no. 1, pp. 20–29,March 2007.

[2] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in2017 IEEE International Conference on Computer Vision (ICCV), Oct2017, pp. 2980–2988.

[3] T. Pham, B. G. Vijay Kumar, T.-T. Do, G. Carneiro, and I. Reid,“Bayesian Semantic Instance Segmentation in Open Set World,” inComputer Vision – ECCV 2018. Springer International Publishing,2018, pp. 3–18.

[4] F. Furrer, T. Novkovic, M. Fehr, A. Gawel, M. Grinvald, T. Sattler,R. Siegwart, and J. Nieto, “Incremental Object Database: Building3D Models from Multiple Partial Observations,” in 2018 IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS),Oct 2018, pp. 6835–6842.

[5] K. Tateno, F. Tombari, and N. Navab, “Real-Time and ScalableIncremental Segmentation on Dense SLAM,” in 2015 IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS),Sept 2015, pp. 4465–4472.

[6] N. Sunderhauf, T. T. Pham, Y. Latif, M. Milford, and I. Reid,“Meaningful Maps With Object-Oriented Semantic Mapping,” in 2017IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), Sep. 2017, pp. 5079–5085.

[7] M. Runz, M. Buffier, and L. Agapito, “MaskFusion: Real-Time Recog-nition, Tracking and Reconstruction of Multiple Moving Objects,” in2018 IEEE International Symposium on Mixed and Augmented Reality(ISMAR), Oct 2018, pp. 10–20.

[8] Q. Pham, B. Hua, D. T. Nguyen, and S. Yeung, “Real-time Progressive3D Semantic Segmentation for Indoor Scene,” arXiv:1804.00257,April 2018.

[9] J. McCormac, R. Clark, M. Bloesch, A. Davison, and S. Leutenegger,“Fusion++: Volumetric Object-Level SLAM,” in 2018 InternationalConference on 3D Vision (3DV), Sep. 2018, pp. 32–41.

[10] B. Hua, Q. Pham, D. T. Nguyen, M. Tran, L. Yu, and S. Yeung,“SceneNN: A Scene Meshes Dataset with aNNotations,” in 2016Fourth International Conference on 3D Vision (3DV), Oct 2016, pp.92–101.

[11] H. Oleynikova, Z. Taylor, M. Fehr, R. Siegwart, and J. Nieto,“Voxblox: Incremental 3D Euclidean Signed Distance Fields for On-Board MAV Planning,” in 2017 IEEE/RSJ International Conferenceon Intelligent Robots and Systems (IROS), Sep. 2017, pp. 1366–1373.

[12] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, andA. C. Berg, “SSD: Single Shot MultiBox Detector,” in ComputerVision – ECCV 2016. Springer International Publishing, 2016, pp.21–37.

[13] J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” in2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), July 2017, pp. 6517–6525.

[14] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” in NeuralInformation Processing Systems (NIPS), 2015.

[15] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,P. Dollar, and C. L. Zitnick, “Microsoft COCO: Common Objects inContext,” in Computer Vision – ECCV 2014. Springer InternationalPublishing, 2014, pp. 740–755.

[16] R. Hu, P. Dollar, K. He, T. Darrell, and R. Girshick, “Learning toSegment Every Thing,” in 2018 IEEE Conference on Computer Visionand Pattern Recognition (CVPR), June 2018, pp. 4233–4241.

[17] T. T. Pham, T. Do, N. Sunderhauf, and I. Reid, “SceneCut: JointGeometric and Object Segmentation for Indoor Scenes,” in 2018 IEEEInternational Conference on Robotics and Automation (ICRA), May2018, pp. 1–9.

[18] J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Seman-ticFusion: Dense 3D Semantic Mapping with Convolutional NeuralNetworks,” in 2017 IEEE International Conference on Robotics andAutomation (ICRA), May 2017, pp. 4628–4635.

[19] Y. Nakajima, K. Tateno, F. Tombari, and H. Saito, “Fast and AccurateSemantic Mapping through Geometric-based Incremental Segmen-tation,” in 2018 IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), Oct 2018, pp. 385–392.

[20] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. J. Kelly, andA. J. Davison, “SLAM++: Simultaneous Localisation and Mapping atthe Level of Objects,” in 2013 IEEE Conference on Computer Visionand Pattern Recognition (CVPR), June 2013, pp. 1352–1359.

[21] K. Tateno, F. Tombari, and N. Navab, “When 2.5D is not enough:Simultaneous Reconstruction, Segmentation and Recognition on denseSLAM,” in 2016 IEEE International Conference on Robotics andAutomation (ICRA), May 2016, pp. 2295–2302.

[22] Y. Nakajima and H. Saito, “Efficient Object-Oriented Semantic Map-ping With Object Detector,” IEEE Access, vol. 7, pp. 3206–3213, 2019.

[23] V. A. Prisacariu, O. Kahler, S. Golodetz, M. Sapienza, T. Cavallari,P. H. Torr, and D. W. Murray, “InfiniTAM v3: A Framework for Large-Scale 3D Reconstruction with Loop Closure,” arXiv:1708.00783, Aug.2017.

[24] M. Keller, D. Lefloch, M. Lambers, S. Izadi, T. Weyrich, and A. Kolb,“Real-Time 3D Reconstruction in Dynamic Scenes Using Point-BasedFusion,” in 2013 International Conference on 3D Vision (3DV), June2013, pp. 1–8.

[25] K. M. Wurm, D. Hennes, D. Holz, R. B. Rusu, C. Stachniss,K. Konolige, and W. Burgard, “Hierarchies of Octrees for Efficient 3DMapping,” in 2011 IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), Sep. 2011, pp. 4249–4255.

[26] E. Vespa, N. Nikolov, M. Grimm, L. Nardi, P. H. J. Kelly, andS. Leutenegger, “Efficient Octree-Based Volumetric SLAM Support-ing Signed-Distance and Occupancy Mapping,” IEEE Robotics andAutomation Letters, vol. 3, no. 2, pp. 1144–1151, April 2018.

[27] T. Schneider, M. Dymczyk, M. Fehr, K. Egger, S. Lynen, I. Gilitschen-ski, and R. Siegwart, “Maplab: An Open Framework for Researchin Visual-Inertial Mapping and Localization,” IEEE Robotics andAutomation Letters, vol. 3, no. 3, pp. 1418–1425, July 2018.