+ All Categories
Home > Documents > LIT: Light-field Inference of Transparency for Refractive ... · Fig. 2: An overview of the LIT...

LIT: Light-field Inference of Transparency for Refractive ... · Fig. 2: An overview of the LIT...

Date post: 31-Jan-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
7
LIT: Light-field Inference of Transparency for Refractive Object Localization Zheming Zhou, Xiaotong Chen and Odest Chadwicke Jenkins Abstract—Translucency is prevalent in everyday scenes. As such, perception of transparent objects is essential for robots to perform manipulation. Compared with texture-rich or texture- less Lambertian objects, transparency induces significant uncer- tainty on object appearance. Ambiguity can be due to changes in lighting, viewpoint, and backgrounds, each of which brings challenges to existing object pose estimation algorithms. In this work, we propose LIT , a two-stage method for transparent object pose estimation using light-field sensing and photoreal- istic rendering. LIT employs multiple filters specific to light- field imagery in deep networks to capture transparent material properties combined with robust depth and pose estimators based on generative sampling. Along with the LIT algorithm, we introduce the first light-field transparent object dataset for the task of recognition, localization and pose estimation. Using proposed algorithm on our dataset, we show that LIT outperforms both a state-of-the-art end-to-end pose estimation method and a generative pose estimator on transparent objects. I. INTRODUCTION Recognizing and localizing objects has a wide range of applications in robotics and remains a very challenging problem. The challenge comes from the variety of objects in the real world and continuous high dimension spaces of object poses. The diversity of object materials also induces strong uncertainty and noise for sensor observation. Existing works and datasets [1], [2], [3] cover a variety of texture-rich objects with distinguishable features between different types of objects. Several other works [4], [5] cover texture-less objects, but robot sensors can still perceive color and depth information from their Lambertian or specular surfaces. However, transparent objects are also prevalent in the real world. In contrast, many assumptions for objects with opaque surface properties are ill-posed for transparent objects. The challenges carried by transparency are multidimen- sional. First, the non-Lambertian surface texture is highly relying on the environment lighting conditions and back- ground appearance. For instance, transparent surfaces will produce specularity from environmental lighting and project distorted background texture on their surfaces due to refrac- tion. Second, transparent objects’ depth information cannot be correctly captured by RGB-D sensors, which are com- monly used by current object recognition and localization methods. This limitation imposes difficulties in collecting transparent object pose data using current labeling tools (e.g. LabelFusion [6]). As a result, transparent object recognition and localization remains challenging for robotic perception. The authors are with the Department of Electrical Engineering and Computer Science, Robotics Institute, University of Michigan, Ann Arbor, MI, USA, 48109-2121 [zhezhou|cxt|ocj]@umich.edu Fig. 1: Demonstration of our LIT pipeline. (Top row) Lytro Illum camera is mounted on the tripod and robot arm to capture the transparent objects in challenging environments. (Bottom row) final estimated poses are overlapped to the center view of the observed light-field image. Recently, several works [7], [8] established that light-field photography shows promising results in perceiving trans- parency. For example, Zhou et al. [9] generated grasp poses for transparent objects by classifying local patch features in a plenoptic descriptor called Depth Likelihood Volume. However, capturing and labeling over light-field images is time-consuming and computationally costly. Synthetic data is an alternative for image generation and has shown encourag- ing results in object recognition and localization. Georgakis et al. [10] rendered photorealistic images by projecting the object texture model on the real background for training object detector. Tremblay et al. [3] proposed DOPE as an end-to-end pose estimator using domain randomization and photorealistic rendering from Unreal gaming engine [11]. We address the problem of transparency in the real world with photorealistic rendering and light-field perception. In this paper, we propose LIT as a transparent object 6D pose estimator. Within the LIT framework, we introduce 3D convolutional light-field filters with neural network trained with pure synthetic data from our customized light-field rendering environments. We leverage network outputs with generative inference to achieve 6D pose estimation. We in- troduce the first light-field dataset for the task of transparent objects recognition, segmentation, and pose estimation. The dataset contains 75000 synthetic light-field images and 300 real images from Lytro Illum light-field camera labeled with arXiv:1910.00721v3 [cs.RO] 24 Oct 2019
Transcript
  • LIT: Light-field Inference of Transparency for RefractiveObject Localization

    Zheming Zhou, Xiaotong Chen and Odest Chadwicke Jenkins

    Abstract— Translucency is prevalent in everyday scenes. Assuch, perception of transparent objects is essential for robots toperform manipulation. Compared with texture-rich or texture-less Lambertian objects, transparency induces significant uncer-tainty on object appearance. Ambiguity can be due to changesin lighting, viewpoint, and backgrounds, each of which bringschallenges to existing object pose estimation algorithms. In thiswork, we propose LIT , a two-stage method for transparentobject pose estimation using light-field sensing and photoreal-istic rendering. LIT employs multiple filters specific to light-field imagery in deep networks to capture transparent materialproperties combined with robust depth and pose estimatorsbased on generative sampling. Along with the LIT algorithm,we introduce the first light-field transparent object datasetfor the task of recognition, localization and pose estimation.Using proposed algorithm on our dataset, we show that LIToutperforms both a state-of-the-art end-to-end pose estimationmethod and a generative pose estimator on transparent objects.

    I. INTRODUCTION

    Recognizing and localizing objects has a wide range ofapplications in robotics and remains a very challengingproblem. The challenge comes from the variety of objectsin the real world and continuous high dimension spaces ofobject poses. The diversity of object materials also inducesstrong uncertainty and noise for sensor observation. Existingworks and datasets [1], [2], [3] cover a variety of texture-richobjects with distinguishable features between different typesof objects. Several other works [4], [5] cover texture-lessobjects, but robot sensors can still perceive color and depthinformation from their Lambertian or specular surfaces.However, transparent objects are also prevalent in the realworld. In contrast, many assumptions for objects with opaquesurface properties are ill-posed for transparent objects.

    The challenges carried by transparency are multidimen-sional. First, the non-Lambertian surface texture is highlyrelying on the environment lighting conditions and back-ground appearance. For instance, transparent surfaces willproduce specularity from environmental lighting and projectdistorted background texture on their surfaces due to refrac-tion. Second, transparent objects’ depth information cannotbe correctly captured by RGB-D sensors, which are com-monly used by current object recognition and localizationmethods. This limitation imposes difficulties in collectingtransparent object pose data using current labeling tools (e.g.LabelFusion [6]). As a result, transparent object recognitionand localization remains challenging for robotic perception.

    The authors are with the Department of Electrical Engineering andComputer Science, Robotics Institute, University of Michigan, Ann Arbor,MI, USA, 48109-2121 [zhezhou|cxt|ocj]@umich.edu

    Fig. 1: Demonstration of our LIT pipeline. (Top row) Lytro Illumcamera is mounted on the tripod and robot arm to capture thetransparent objects in challenging environments. (Bottom row) finalestimated poses are overlapped to the center view of the observedlight-field image.

    Recently, several works [7], [8] established that light-fieldphotography shows promising results in perceiving trans-parency. For example, Zhou et al. [9] generated grasp posesfor transparent objects by classifying local patch featuresin a plenoptic descriptor called Depth Likelihood Volume.However, capturing and labeling over light-field images istime-consuming and computationally costly. Synthetic data isan alternative for image generation and has shown encourag-ing results in object recognition and localization. Georgakiset al. [10] rendered photorealistic images by projecting theobject texture model on the real background for trainingobject detector. Tremblay et al. [3] proposed DOPE as anend-to-end pose estimator using domain randomization andphotorealistic rendering from Unreal gaming engine [11]. Weaddress the problem of transparency in the real world withphotorealistic rendering and light-field perception.

    In this paper, we propose LIT as a transparent object 6Dpose estimator. Within the LIT framework, we introduce 3Dconvolutional light-field filters with neural network trainedwith pure synthetic data from our customized light-fieldrendering environments. We leverage network outputs withgenerative inference to achieve 6D pose estimation. We in-troduce the first light-field dataset for the task of transparentobjects recognition, segmentation, and pose estimation. Thedataset contains 75000 synthetic light-field images and 300real images from Lytro Illum light-field camera labeled with

    arX

    iv:1

    910.

    0072

    1v3

    [cs

    .RO

    ] 2

    4 O

    ct 2

    019

  • Fig. 2: An overview of the LIT framework with a dataset. (a) The LIT-Pose dataset contains 75000 synthetic light field images in trainingset and 300 real images with 442 object instances in testing set. (b) LIT estimator is a two-stage pipeline. The first stage takes light-fieldimages as input and outputs transparent material segmentation and object center point prediction. The segmentation results are passedthrough a detection network to obtain object labels. In the second stage, for each predicted center point, we predict point depth likelihoodby local depth estimation using Depth Likelihood Volume. The particle optimization samples over center points and converge to the posethat best matches the segmentation results.

    segmentation and 6D poses. We demonstrate the efficacy ofthe proposed method with respect to a state-of-the-art end-to-end method and a generative method on our proposedtransparent object dataset.

    II. RELATED WORK

    A. Pose Estimation for Robot Manipulation

    6D pose estimation remains to be a central problem inrobot perception for manipulation in recent years, and deeplearning has become a powerful tool for accurate and fastinference in this field. Regarding end-to-end methods, Xianget al. [12] propose PoseCNN, where the object’s label,position on image, depth, and 3D orientation are estimated indifferent branches in the network. This line of research alsoexplored using synthetic data on training [3], [13], pixel-wise voting scheme on keypoints regression with 2D-3Dcorrespondence solvers like PnP [14], [15], and residualnetworks to iteratively refine object poses [5], [2]. Hybridmethods usually achieve better performance, which use deepnetworks to give hypotheses of object locations or 6D poses,and then use probabilistic generative methods [1], [16],template matching [17], or point cloud registration methods,like Iterative Closest Points [4] or Congruent Sets [18], toget the final pose estimates.

    Most deep-learning-based methods for pose estimation arefocused on texture-rich objects or those with texture-lessbut Lambertian surfaces [17], [4]. Transparent objects bringchallenges in two main aspects: no reliable depth informa-tion, and no distinguishable environment-independent RGBtextures. We take inspiration from previous works that mighttransfer to transparent object estimation: A decent detection

    or segmentation intermediate result plays an important rolein restricting the search area of the 6D object pose; A deepnetwork trained on a large, elaborately designed syntheticdataset can reach similar performance with those trained onthe real world data.

    B. light-field Perception for TransparencyThe foundation of light-field image rendering was first

    introduced by Levoy and Hanrahan [19] for the purpose ofsampling new views from existed images. Built on this work,light-field camera has shown advancement in performingvisual tasks in challenge environments due to its abilityto capture both light intensity and direction. Transparencyis one of those common challenge scenes that researchershas been explored. Maeno et al. [20] proposed the light-field distortion feature from epipolar images for recognizingtransparent objects from background images. Recent workby Tsai et al. [21] further explore the light-field featuredifferences between transparent and Lambertian material.The result has shown that the distortion feature in epipolarimage can distinguish materials with different refractionproperty. Apart from refraction, specular reflection is theother perception challenge that transparent material carries.Tao et al. [22] investigated the line consistency in light-field image with dichromatic reflection model to remove thespecularity from the image. Alperovich et al. [23] proposeda fully convolutional network encoder to separate specularityfrom light-field image. In the robotics field, Zhou et al. [7],[9] create DLV to model the depth uncertainty in a layeredtranslucent environment. Based on this DLV descriptor, theyinfer the object and grasp poses for robot manipulation. Ourproposed work is built on ideas described above and leverage

  • the power of deep learning, photo-realistic rendering, andgenerative inference.

    III. LIT ESTIMATOR

    The objective of object 6D pose estimation in a light-field image can be formalized as finding a rigid transfor-mation (Translation T and Rotation R) in SE(3) fromobject coordinate frame O to camera coordinate frame C.Because of the 4D structure of the light-field images, aplenoptic camera cannot be treated as a single coordinateframe. Instead, it is designed as a composition of sub-apertures or can be decomposed as a virtual camera array.We assumed all cameras have an identical spatial resolution,(hs, ws) and each sub-aperture camera has a relative locationindex, called angular resolution, (ha, wa). Without loss ofgenerality, we assume a light-field camera coordinate frameC is overlapping with the center view camera coordinateframe Ccenter at the center of (ha, wa) plane. Meanwhile,we assume the object 3D models and basic material typesare available to our pipeline.

    A. LIT pipeline

    The two-stage LIT pipeline is shown in Fig. 2. The firststage consists of a two-stream neural network that outputspixel-wise image segmentation and 2D object center pointlocations. This output is followed by a detection networkthat classifies object labels and clusters the center points.The second-stage includes a light-field based object depthestimator giving object center depth distributions, and aparticle optimization process converging to the final 6Dposes.

    There are several insights incorporated in the pipelinedesign. Firstly, the segmentation decoder branch in the firstneural network does transparent material segmentation ratherthan object-class or instance segmentation. This distinctionmeans it only decides whether a pixel belongs to a trans-parent object or not, instead of which type of transparentobject it belongs to. The object classification problem isfurther settled in the following detection network. Here thereason of task decomposition is that pixel values withinobject areas highly depend on the background and materialproperty, rather than object types, so it is difficult for asingle network to distinguish different objects from rawpixel values. In addition, the center point estimation branchdoes not regress multiple keypoints which is common intexture-rich object pose estimation networks [14], [15]. Therationale is that transparent objects lack features that areindependent to object poses and environmental changes,such as background and lighting. In other words, the samepoint on the object may have various appearances. In ourexplorations, we find the networks perform worse in end-to-end object-wise segmentation, and they fail in differentiating3D bounding keypoints except the center point.

    B. Network Architecture

    As shown in Fig. 2, the input light-field image with angularresolution (ha, wa) are first decomposed into sub-aperture

    Fig. 3: Illustration of three light-field filters. Angular filter (AF) hasdimension 1× 1× (ha ×wa) to capture features in angular pixels.sEPI and tEPI filters have sizes of n × n × wa and n × n × harespectively, here n refers to kernel size. tEPI also has a dilationwa. All features will be concatenated together after passing filters.

    image stacks, which gives a 3D matrix with size hs ×ws ×(ha ×wa) for each of the R, G, B channels. The stacks arethen going through three light-field filters: angular filter [24],3D sEPI filter, and 3D tEPI filter.

    • Angular Filter. The angular filter aims to capture thereflection property of 3D surface points in the directionspace of light ray. For instance, a non-Lambertiansurface will establish different colors in a single angularpatch while it will be nearly identical for a Lambertiansurface. The angular filter can be expressed as anoperation over each pixel (x, y) in spatial space (forthe jth filter):

    g(∑s,t

    wji (s, t)Li(x, y, (s, t))) (1)

    where g(·) is the activation function, s and t are angularindices, wji is weight in angular filter, i ∈ {r, g, b}is color channels, and Li(x, y, (s, t)) is 4D light fieldfunction specific to color channel i.

    • 3D EPI Filters. Transparent surfaces will producedistortion features [20] because of refraction. In theepipolar image plane, it will produce polynomial curvepatterns which can be distinguished from the back-ground texture without distortion. To capture distortionfeatures, we propose the epipolar filters using 3D con-volutional layers along the two angular dimensions sand t respectively. The 3D EPI filters can be expressedas:

    g(∑u,v,s

    w̃ji (u, v, s)Li(x+ u, y + v, (s, t)))

    g(∑u,v,t

    ŵji (u, v, t)Li(x+ u, y + v, (s, t)))(2)

    where (u, v) is the index of convolutional kernel inspatial space, w̃, ŵ are weights in sEPI and tEPI filters,and we assume the input and output have the samedimension in spatial space by proper paddings.

  • Passing the three customized filters, the embedded featuresof light-field images are concatenated and passed throughan encoder-decoder structure with two branches for imagesegmentation and object center point regression. The outputof the segmentation branch will be a pixel-wise segmentationof the center view image, in which each pixel is predictedto be transparent material, background, or the boundary. Theoutput of the center point branch will be the 2D pixel offsetsfrom each pixel to their estimated center position on theimage, as well as a pixel-wise confidence value.

    The loss in segmentation branch Lseg is defined as thecross-entropy loss normalized by class pixel probabilities[25]. The loss of center point regression is mainly followingdesign in [14], while we only regress the center pointpositions. The learning goal for each pixel p inside thesegmentation area M is to regress the offset hp from itslocation cp to the object center gp on 2D image. In this way,the loss Lpos is expressed as:

    Lpos =∑p∈M

    ‖gp − (cp + hp)‖1 (3)

    where ‖·‖1 denotes L1 loss. Each pixel’s estimation isassociated with a confidence value wp, and the confidenceloss Lconf is defined as:

    Lconf =∑p∈M

    ∥∥wp − exp (−τ ‖gp − (cp + hp)‖2)∥∥1 (4)where τ is a modulating factor and ‖·‖2 denotes L2 loss.And the overall loss L is calculated as:

    L = αLseg + β(Lpos + γLconf ) (5)

    where α, β, γ modulates the importance of segmentation,regression and regression confidence respectively.

    An object detection network is appended to differentiateobject types based on geometry shapes from segmentationresults. Specifically, the network takes the result of segmen-tation decoder branch as input and gives bounding boxes withobject labels. Detected bounding boxes also play the role ofclustering object center points. The overall output of the firststage is a set of bounding boxes, each with an object labeland a set of object center points, which serves as the initialdistribution of object center locations for the next stage.

    C. Particle Optimization

    The second stage of pipeline estimates the 6D pose oftransparent objects in a sampling-based iterative likelihoodreweighting process [26]. Object pose samples are initializedbased on the center point locations from the first stage.During the iterations, rendered samples are projected to 2Dimage and their likelihoods are calculated as the similaritybetween the projected rendered samples and segmentationresults.

    1) Depth Estimation of center points: Instead of directlyregressing the depth of center points to initialize the particles,we deploy a plenoptic descriptor called depth likelihoodvolume (DLV) [7]. DLV describes the depth of a single pixelas a likelihood function rather than a deterministic value.

    The advantage of using DLV is the depth likelihood can benaturally leveraged into generative inference framework insample initialization step. The likelihood D(xc, yc, d) of agiven center point located at (xc, yc) in center view imageplane Ic can be calculated as:

    D(xc, yc, d) =1

    N

    ∑a∈A\Ic

    Ta,d(xc, yc) (6)

    where A is sub-aperture views, Ta,d(xc, yc) is the functionto calculate the color intensity and gradient cost of pixel(xc, yc) on a specific depth d. 1N is a normalization termthat maps cost to likelihood. Detailed implementation canbe referred in [7], [9].

    2) Sample Initialization: Each sample is a hypothesis ofobject 6D pose. Its 3D location can be derived from 2Dimage coordinate (u, v), depth d and camera parameters.In this way, the probability distribution of 3D center pointlocations is formed by leveraging center point candidates anddepth likelihood volume results:

    u = cx + fxx

    z, v = cy + fy

    y

    z, d = z

    p(X = x, Y = y, Z = z) = wc(u, v)D(u, v, d)(7)

    where wc are object center point confidence values,fx, fy, cx, cy are camera intrinsic parameters, and D islikelihood from DLV in Equation (6). We perform impor-tance sampling over this distribution to initialize the posesample locations. The orientations of samples are randomlyinitialized.

    3) Likelihood Function: The probability of each sampleduring iterations is calculated using the likelihood function,represented as the similarity between the projected renderedobject point cloud and segmentation results from neuralnetwork. Specifically, the object points in its local frameare transformed by the sample pose and then projected to2D image plane. The likelihood function is composed ofintersection over union scores of projected rendered pointclouds and segmentation masks on transparent material andits boundary:

    weight = η|Spcd ∩ Sseg||Spcd ∪ Sseg|

    + (1− η) |∂Spcd ∩ ∂Sseg||∂Spcd ∪ ∂Sseg|

    (8)

    where Spcd is the silhouette of projected rendered pointcloud, Sseg is the pixels segmented as transparent materials,∂Spcd and ∂Sseg are the sets of boundary pixels of Spcdand Sseg respectively. η is set to modulate importance ofboundaries.

    4) Update Process: We follow the procedure of iterativelikelihood reweighting to produce pose estimations. Theinitialized samples are assigned the same weight. Thenthe circulation of calculating likelihood values, resamplingbased on weights, and sample diffusion is repeated in everyiteration. During diffusion, each pose sample is randomlydiffused in SE(3) space subject to zero-mean Gaussiannoises in translation and rotation independently. The algo-rithm terminates when the maximum sample weight reachesa threshold, or the iteration number reaches the limit.

  • (a) Training Set (b) Testing Set (c) Result

    Fig. 4: (Left) example synthetic light field images rendered in three different environments. (Middle) example test images in differentbackgrounds and different pose configurations. (Right) results visualization for example test image by overlaying estimated poses to theoriginal center view image.

    IV. LIGHT-FIELD DATASET

    We propose a dataset of light-field images for the task oftransparent object recognition, segmentation, and 6D poseestimation. The dataset is gathered in different householdenvironments with different viewpoints, lighting conditions.There are 5 types of objects included in the dataset: {winecup, tall cup, glass jar, champagne cup, starbucks bottle} withdifferent geometry shapes. The images are captured usinga Lytro Illum camera with different camera settings. Foreach setting, we calibrate camera using the toolbox describedin [27]. The spatial resolution of the calibrated image is383× 552, and the angular resolution is 14× 14. Since theLytro camera has a very small baseline between adjacentsub-aperture images, we extract 5 × 5 angular pixels withstride size 1 from calibrated images for both dataset andour algorithm. The dataset contains a total of 75000 trainingimages and 300 real world images with 442 object instances,each labeled with pixel-wise semantic segmentation and 6Dobject poses. Fig. 4 shows examples of synthetic trainingdata, real-world test data and estimation results using LIT.The pose is labeled by re-projecting objects directly intocenter view image and match with observations.

    The captured real data are treating as the testing set forLIT algorithm. For training the two-stream network of LITpipeline, we use rendered light-field images which are alsoincluded in the dataset with generation tools.

    The light-field rendering pipeline is built on NDDS [11]synthetic data generation plugin in Unreal Engine 4 (UE4).The created virtual light-field capturer has angular resolution5×5 and spatial resolution 224×224. The baseline betweenadjacent virtual camera is 0.1cm. We generate data in threeUE4 world environments: room, temple, and forest. In eachenvironment, we highly randomize the lighting conditionsincluding color, direction, and intensity. The target objectsare rendered using the translucent material category withdifferent material parameter settings. Objects move in two

    ways in the environment: flying in the air with randomtranslation and rotation, or falling freely with collision andgravity enabled. When the objects move, the virtual light-field capturer will track and look at them with arbitraryazimuths and elevations.

    V. EXPERIMENTS

    Input light-field images have spatial resolution 224× 224and angular resolution 5 × 5. we choose 64 angular filters,3D sEPI filters, and 3D tEPI filters. The encoder-decoderis using VGG-16 [28] structure as backbone architecturesand initialized with pre-trained model on ImageNet [29].The segmentation branch outputs pixel-wise class from threeclasses {background, transparent, boundary}. The detectionnetwork is a Faster R-CNN network [30] with VGG16backbone. The input to the network is the binary masksof transparent class and its output are bounding boxes withobject labels.

    A. Evaluation of light-field filters on image segmentation

    Segmentation is taken as the optimization target in our sec-ond stage which is critical to LIT pipeline. We first comparewith two baseline methods to show the advantage of usinglight-field image and three light-field specific filters. Oneonly input with 2D center view image (same neural networkstructure as LIT but excludes light-field filters), the other isan ablation study with only angular filter. All three networksare trained on the synthetic dataset containing 75000 images.Table. I shows accuracy results, where LIT achieves higherscores than baseline methods in all metrics. LIT outperformsthe baseline method with single RGB input, which indicatesthat light-field image’s capacity in capturing the direction oflight can help in transparent material segmentation. Throughcomparison with the baseline with only angular filter, LITalso achieves higher accuracy, showing that both angularfeatures and EPI features are important in contributing torecognizing transparent objects.

  • Method gAcc mAcc mIoU wIoU mBFS2D 0.871 0.500 0.228 0.397 0.140

    AF only 0.917 0.501 0.318 0.582 0.197LIT 0.954 0.520 0.455 0.854 0.390

    TABLE I: Comparison of LIT and baseline methods on transpar-ent material segmentation. The performance is quantified throughglobal accuracy (gAcc), mean of class accuracy (mAcc), meanof Intersection over Union (mIoU), weighted IoU (wIoU),andmean BF (Boundary F1) contour matching score (mBFS). Detaileddefinitions are defined in [31]. ‘AF only’ here refers to the baselinemethod with only angular filter.

    B. Evaluation of pose estimationWe compare the 6D pose estimation results of LIT

    against a state-of-the-art end-to-end deep learning method,DOPE [3], and a generative light-field based transparentobject pose estimation method, PMCL [7]. Since DOPE alsouses pure synthetic data for training and has already outper-formed PoseCNN [12] which itself outperforms other single-shot pose estimation networks, the comparison between LITand DOPE can show our capability on transparent objectpose estimation.

    Fig. 5: Comparison of 6D pose estimation results with respect toADD-S and Accuracy Under Curve metric.

    Also, for the fair comparison with DOPE, we make itcompatible with light-field inputs. We add the three light-field filters in Section. III before the first encoder layer ofDOPE network. Both LIT and DOPE are trained with 75000synthetic images for 5 objects. PMCL requires object labelsand 3D workspaces for generative inference. We initializePMCL with ground truth object labels and workspaces with

    volume of 40× 40× 40 cm3 around the ground truth objectlocations. We use ADD-S metric [12] to evaluate the poseof symmetric objects. We then show the accuracy curves inFig. 5 with a distance threshold of 0.1m. The Area Underaccuracy-threshold Curve (AUC) values and algorithm timecost per object are shown in Table. II.

    From the result plots, we find that LIT performs muchbetter than DOPE and a bit better than PMCL. For DOPE,we think the way to directly regress the eight 3D boundingbox vertices and their relations is not an optimal strategyfor transparent objects. First, DOPE’s object recognition isembedded in the network but the transparent object’s textureis not informative to distinguish different objects. Secondly,the eight vertices of 3D bounding boxes are ambiguousfor networks to learning the features because of the objectsymmetry and lack of distinguishable features. For PMCL,since we provide it with ground truth labels and workspace,it performs comparatively well in the test set. However,PMCL uses single-view DLV as matching target which in-cludes noise from specularity and distortion from transparentsurfaces. Furthermore, DLV construction is computationallycostly, which can take 300 seconds to complete the process.Our LIT pipeline uses neural network to output segmentationas a lightweight matching target for generative inference andcenter points for particle initialization which shows to be abetter strategy in dealing with transparency.

    AUC wc tc gj cc sb all time(s)/objDOPE 0.14 0.16 0.21 0.16 0.00 0.18 < 1PMCL 0.24 0.32 0.46 0.28 0.34 0.32 300LIT 0.38 0.32 0.62 0.35 0.44 0.45 < 10

    TABLE II: Comparison of LIT, DOPE and PMCL on transparentobject pose estimation. Here wc, tc, gj, cc and sb are referring towine cup, tall cup, glass jar, champagne cup, and starbucks bottle.All the numbers except for the last column refers to the area underaccuracy-threshold values.

    VI. CONCLUSIONSWe introduce LIT, a two-stage pose estimator for trans-

    parent objects using light-field perception. LIT employs thelearning power of deep networks to distinguish transparentobjects across light-field sub-aperture images. We show thatthe network trained only on synthetic data can give a goodsegmentation on transparent materials, which is served asprior for second stage pose estimation. We also show the ef-fectiveness of decomposing the 6D pose estimation probleminto sub-modules, 2D detection, depth prediction, and 3Dorientation estimation, through comparison with the end-to-end state-of-the-art deep networks. Along with the method,we also propose the first light-field transparent object datasetincluding synthetic data and real data for the task of objectrecognition, segmentation, and 6D pose estimation. Finally,although our methods are aimed to deal with objects withtransparency and refractive material, it can also be appliedto other household objects with different surface materialproperties. Future works built on LIT can extend to morecomplex scene understanding for robot manipulation.

  • REFERENCES[1] Zhiqiang Sui, Zheming Zhou, Zhen Zeng, and Odest Chadwicke

    Jenkins. Sum: Sequential scene understanding and manipulation. InIEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), 2017, pages 3281–3288. IEEE, 2017.

    [2] Chen Wang, Danfei Xu, Yuke Zhu, Roberto Martı́n-Martı́n, CewuLu, Li Fei-Fei, and Silvio Savarese. Densefusion: 6d object poseestimation by iterative dense fusion. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pages 3343–3352, 2019.

    [3] Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang,Dieter Fox, and Stan Birchfield. Deep object pose estimation forsemantic robotic grasping of household objects. arXiv preprintarXiv:1809.10790, 2018.

    [4] Martin Sundermeyer, Zoltan-Csaba Marton, Maximilian Durner,Manuel Brucker, and Rudolph Triebel. Implicit 3d orientation learningfor 6d object detection from rgb images. In Proceedings of theEuropean Conference on Computer Vision (ECCV), pages 699–715,2018.

    [5] Yi Li, Gu Wang, Xiangyang Ji, Yu Xiang, and Dieter Fox. Deepim:Deep iterative matching for 6d pose estimation. In Proceedings of theEuropean Conference on Computer Vision (ECCV), pages 683–698,2018.

    [6] Pat Marion, Peter R Florence, Lucas Manuelli, and Russ Tedrake.Label fusion: A pipeline for generating ground truth labels for realrgbd data of cluttered scenes. In IEEE International Conference onRobotics and Automation (ICRA), 2018, pages 1–8. IEEE, 2018.

    [7] Zheming Zhou, Zhiqiang Sui, and Odest Chadwicke Jenkins. Plenopticmonte carlo object localization for robot grasping under layeredtranslucency. In IEEE/RSJ International Conference on IntelligentRobots and Systems (IROS), 2018, pages 1–8. IEEE, 2018.

    [8] John Oberlin and Stefanie Tellex. Time-lapse light field photographyfor perceiving transparent and reflective objects. 2017.

    [9] Zheming Zhou, Tianyang Pan, Shiyu Wu, Haonan Chang, andOdest Chadwicke Jenkins. Glassloc: Plenoptic grasp pose detectionin transparent clutter. arXiv preprint arXiv:1909.04269, 2019.

    [10] Georgios Georgakis, Arsalan Mousavian, Alexander C Berg, and JanaKosecka. Synthesizing training data for object detection in indoorscenes. arXiv preprint arXiv:1702.07836, 2017.

    [11] Thang To, Jonathan Tremblay, Duncan McKay, Yukie Yamaguchi,Kirby Leung, Adrian Balanon, Jia Cheng, William Hodge, and StanBirchfield. NDDS: NVIDIA deep learning dataset synthesizer, 2018.https://github.com/NVIDIA/Dataset_Synthesizer.

    [12] Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox.Posecnn: A convolutional neural network for 6d object pose estimationin cluttered scenes. arXiv preprint arXiv:1711.00199, 2017.

    [13] Josip Josifovski, Matthias Kerzel, Christoph Pregizer, Lukas Posniak,and Stefan Wermter. Object detection and pose estimation based onconvolutional neural networks trained with synthetic data. In 2018IEEE/RSJ International Conference on Intelligent Robots and Systems(IROS), pages 6269–6276. IEEE, 2018.

    [14] Yinlin Hu, Joachim Hugonot, Pascal Fua, and Mathieu Salzmann.Segmentation-driven 6d object pose estimation. In Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition,pages 3385–3394, 2019.

    [15] Sida Peng, Yuan Liu, Qixing Huang, Xiaowei Zhou, and HujunBao. Pvnet: Pixel-wise voting network for 6dof pose estimation. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 4561–4570, 2019.

    [16] Xiaotong Chen, Rui Chen, Zhiqiang Sui, Zhefan Ye, Yanqi Liu,R Bahar, and Odest Chadwicke Jenkins. Grip: Generative robust in-ference and perception for semantic robot manipulation in adversarialenvironments. arXiv preprint arXiv:1903.08352, 2019.

    [17] Kiru Park, Timothy Patten, Johann Prankl, and Markus Vincze. Multi-task template matching for object detection, segmentation and poseestimation using depth images. In 2019 International Conference onRobotics and Automation (ICRA), pages 7207–7213. IEEE, 2019.

    [18] Chaitanya Mitash, Abdeslam Boularias, and Kostas Bekris. Robust 6Dobject pose estimation with stochastic congruent sets. arXiv preprintarXiv:1805.06324, 2018.

    [19] Marc Levoy and Pat Hanrahan. Light field rendering. In Proceedingsof the 23rd annual conference on Computer graphics and interactivetechniques, pages 31–42. ACM, 1996.

    [20] Kazuki Maeno, Hajime Nagahara, Atsushi Shimada, and Rin-ichiroTaniguchi. Light field distortion feature for transparent object recog-nition. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 2786–2793, 2013.

    [21] Dorian Tsai, Donald G Dansereau, Thierry Peynot, and Peter Corke.Distinguishing refracted features using light field cameras with ap-plication to structure from motion. IEEE Robotics and AutomationLetters, 4(2):177–184, 2018.

    [22] Michael W Tao, Jong-Chyi Su, Ting-Chun Wang, Jitendra Malik, andRavi Ramamoorthi. Depth estimation and specular removal for glossysurfaces using point and line consistency with light-field cameras.IEEE transactions on pattern analysis and machine intelligence,38(6):1155–1169, 2015.

    [23] Anna Alperovich, Ole Johannsen, Michael Strecke, and Bastian Gold-luecke. Light field intrinsics with a deep encoder-decoder network. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 9145–9154, 2018.

    [24] Ting-Chun Wang, Jun-Yan Zhu, Ebi Hiroaki, Manmohan Chandraker,Alexei A Efros, and Ravi Ramamoorthi. A 4d light-field dataset andcnn architectures for material recognition. In European Conferenceon Computer Vision, pages 121–138. Springer, 2016.

    [25] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and PiotrDollár. Focal loss for dense object detection. In Proceedings of theIEEE international conference on computer vision, pages 2980–2988,2017.

    [26] Stephen J McKenna and Hammadi Nait-Charif. Tracking humanmotion using auxiliary particle filters and iterated likelihood weighting.Image and Vision Computing, 25(6):852–862, 2007.

    [27] Yunsu Bok, Hae-Gon Jeon, and In So Kweon. Geometric calibrationof micro-lens-based light field cameras using line features. IEEEtransactions on pattern analysis and machine intelligence, 39(2):287–300, 2017.

    [28] Karen Simonyan and Andrew Zisserman. Very deep convolu-tional networks for large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014.

    [29] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet:A Large-Scale Hierarchical Image Database. In CVPR09, 2009.

    [30] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.In Advances in neural information processing systems, pages 91–99,2015.

    [31] Gabriela Csurka, Diane Larlus, Florent Perronnin, and France Meylan.What is a good evaluation measure for semantic segmentation?. InProceedings of the British Machine Vision Conference, pages 32.1–32.11, 2013.

    https://github.com/NVIDIA/Dataset_Synthesizer

    I INTRODUCTIONII RELATED WORKII-A Pose Estimation for Robot ManipulationII-B light-field Perception for Transparency

    III LIT ESTIMATORIII-A LIT pipelineIII-B Network ArchitectureIII-C Particle OptimizationIII-C.1 Depth Estimation of center pointsIII-C.2 Sample InitializationIII-C.3 Likelihood FunctionIII-C.4 Update Process

    IV LIGHT-FIELD DATASETV EXPERIMENTSV-A Evaluation of light-field filters on image segmentationV-B Evaluation of pose estimation

    VI CONCLUSIONSReferences


Recommended