Probabilistic Articulated Real-Time Tracking for Robot...

Probabilistic Articulated Real-Time Tracking for Robot Manipulation

Cristina Garcia Cifuentes1, Jan Issac1, Manuel Wuthrich1, Stefan Schaal1,2 and Jeannette Bohg1

Abstract— We propose a probabilistic filtering method whichfuses joint measurements with depth images to yield a precise,real-time estimate of the end-effector pose in the camera frame.This avoids the need for frame transformations when using itin combination with visual object tracking methods.

Precision is achieved by modeling and correcting biases in thejoint measurements as well as inaccuracies in the robot model,such as poor extrinsic camera calibration. We make our methodcomputationally efficient through a principled combination ofKalman filtering of the joint measurements and asynchronousdepth-image updates based on the Coordinate Particle Filter.

We quantitatively evaluate our approach on a datasetrecorded from a real robotic platform, annotated with groundtruth from a motion capture system. We show that our methodis robust and accurate even under challenging conditions suchas fast motion, significant and long-term occlusions, and time-varying biases. We release the dataset along with open-sourcecode of our method to allow quantitative comparison withalternative approaches.

I. INTRODUCTION

Autonomous grasping and manipulation remain a frontierin robotics research, especially in complex scenarios whichare characterized by unstructured, dynamic environments.Under such conditions, it is impossible to accurately predictall consequences of an action far into the future. Therefore,open-loop execution of offline-planned manipulation actionsis very likely to fail in such scenarios.

A key ingredient for accurate manipulation is to continu-ously and precisely estimate the configuration of the robot’smanipulator and the target objects. This configuration canbe expressed as the 6-degree-of-freedom (DoF) poses ofall objects of interest in a common frame of reference.The focus of this paper is on the estimation of the end-effector pose with respect to the camera frame. We refer tothis problem as robot tracking. The most common approachto estimating the end-effector pose is to apply forwardkinematics using joint angle measurements. However, errorsin the joint measurements are common, due to sensor driftor bias, and complex mechanical effects like cable stretch.Inaccuracies in the kinematic model are quite common aswell, because the locations of different parts of the robotwith respect to each other might not be perfectly known.Even slight errors in either of these two may lead to largeerrors in the end-effector pose. On the other hand, depthcameras can be used to accurately estimate the end-effector

*This research was supported in part by National Science Foundationgrants IIS-1205249, IIS-1017134, EECS-0926052, the Office of NavalResearch, the Okawa Foundation, and the Max-Planck-Society.

1 Autonomous Motion Department at the Max Planck Institute for Intelli-gent Systems, Tubingen, Germany. Email: [email protected]

2 Computational Learning and Motor Control lab at the University ofSouthern California, Los Angeles, CA, USA.

(a) Apollo (b) ARM

(c) Sequence from Apollo with large simulated bias and strong occlusions

(d) Sequence from ARM with moderate real biases

Fig. 1. Our method produces real-time, accurate end-effector poses inthe camera frame, despite of joint encoder biases and/or poor extrinsiccalibration of the camera to the robot. (a, b) The two robotic platforms usedfor data recording and evaluation. (c, d) We can see that for both robots ourestimate (in orange) aligns well with the visual information. This enablesprecise interaction with objects tracked in the same camera frame. In white:the naive forward kinematics, for comparison. Best viewed in color.

pose, but images are typically received at a lower rate,often with significant delay. Additionally, the computationrequired to obtain an estimate from a depth image is typicallylarge, which further increases the delay until the estimate isavailable.

In order to achieve precise, real-time robot tracking, wepresent a method that combines joint measurements withdepth images from a camera mounted on the robot, so as toget the advantages from both: accurate, up-to-date estimatesat a high rate. We formulate the problem in the frameworkof recursive Bayesian estimation, enabling principled, onlinefusion of the two sources of information. We account for the

main sources of error along the kinematic chain by explicitlymodeling and estimating the biases of the joint measure-ments, and a time-varying 6-DoF transform describing acorrection of the extrinsic calibration of the camera withrespect to the robot. The algorithm we derive is computation-ally efficient and well-suited for real-time implementation.Our code is publicly available (https://github.com/bayesian-object-tracking) and can be used off-the-shelf on any robotgiven its kinematic model.

For experimental validation, we collect a dataset from twohumanoid robotic platforms. It covers a range of challengingconditions, including fast arm and head motion as well aslarge, long-term occlusions. We also modify the original datato simulate large joint biases. We demonstrate the robustnessand accuracy of our system quantitatively and qualitativelycompared to multiple baselines. We make this dataset andevaluation code public to allow evaluation and comparisonof this and other methods.

To summarize, our contributions are: (i) the descriptionof a probabilistic model and a computationally-efficientalgorithm (Sections III and IV); (ii) a practically useful,real-time implementation that we make publicly available;(iii) experimental validation of its robustness and accuracy(Sections V and VI); and (iv) the release of a new datasetfor quantitative evaluation.

II. RELATED WORK

Inaccurate and uncertain hand-eye coordination is verycommon for robotics systems. This problem has thereforebeen considered frequently in the robotics community.

One approach is to calibrate the transformation betweenhand and eye offline, prior to performing the manipulationtask. Commonly, the robot is required to hold and movespecific calibration objects in front of its eyes [1], [2],[3]. This can however be tedious, time-consuming and thecalibrated parameters may degrade over time so that theprocess has to be repeated. Instead, our approach is tocontinuously track the arm during a manipulation task, whichis robust against drifting calibration parameters or againstonline effects, such as cable stretch due to increased load orcontact with the environment.

One possible solution is the use of fiducial markers onspecific parts of the robot [4]. Markers have the advantageof being easy to detect but the disadvantage of limitingthe arm configurations to keep the markers always in view,and requiring to precisely know their position relative tothe robot’s kinematic chain. As an alternative to markers,different 2D visual cues can be used to track the entire arm,at the cost of higher computational demand, e.g. texture,gradients or silhouettes, which are also often used in generalobject tracking [5], [6], [7], [8]. Instead, we choose as visualsensor a depth camera, which readily provides geometric in-formation while being less dependent on illumination effects.

In this paper, we are particularly interested in the ques-tion of how to leverage multi-modal sensory data frome.g. proprioception as available in a robotic system. In thefollowing, we review work from the class of marker-less,

model-based, multi-modal tracking methods that estimate theconfiguration of articulated objects online and in real-time,assuming access to joint encoder readings.

Many formulate this problem as the minimization ofan objective function for each new incoming frame, andtypically use the solution from the previous time step asinitialization.

Klingensmith et al. [9] present a simple but computation-ally efficient articulated Iterative Closest Point (ICP) [10]method to estimate the joint encoder bias of their robotarm. The objective function is defined in terms of distancebetween randomly sampled 3D points on the robot modeland the point cloud from a depth camera. This is minimizedby exploiting the pseudo-inverse of the kinematic Jacobianfor computing the gradient.

Pauwels et al. [11] consider the problem of simultaneouslyestimating the pose of multiple objects and the robot armwhile it is interacting with them; all relative to a freelymoving RGB-D camera. The objective is defined accordingto the agreement between a number of visual cues (depth,motion) when comparing observations with rendered imagesgiven the state. Instead of estimating the robot joint config-uration, the authors consider the arm as a rigid object giventhe encoder values, which are assumed to be precise. Tocope with remaining error in the model, robot base and end-effector are considered single objects with the kinematicsbeing enforced through soft-constraints.

There is a related family of methods which optimize forpoint estimates as well, but additionally consider a model ofthe temporal evolution the state, and combine them withinfiltering-based algorithms.

Krainin et al. [12] propose a method for in-hand modelingof objects, which requires an estimate of the robot armpose to be able to segment the hand from the object. Theyperform articulated ICP similar to [9], and use the result asa measurement of joint angles in a Kalman filter.

Hebert et al. [13] estimate the pose of the object relativeto the end-effector, and the offset between the end-effectorpose according to forward kinematics and visual data. Theyconsider a multitude of cues from stereo and RGB data, suchas markers on the hand, the silhouette of object and arm,and 3D point clouds. They employ articulated ICP similarto [9] or 2D signed distance transforms similar to [7] foroptimization. An Unscented Kalman Filter fuses the results.

Schmidt et al. [14] propose a robot arm tracking methodbased on an Extended Kalman Filter. They estimate the biasof the joint encoders similar to [9]. Additional to agreementwith depth images, they include physics-based constraintsinto their optimization function, to penalize interpenetrationbetween object and robot as well as disagreement betweencontact sensors and the estimate.

In contrast to the filtering-based methods described above,our approach is to model the acquisition of the actual,physical measurements, rather than the uncertainty associatedto an optimization result. Probability theory then provides awell-understood framework for fusing the different informa-tion sources. This leads to a method with few parameters

which are intuitive to choose, because they are closely relatedto the physics of the sensors.

The method we propose extends our previous work onvisual tracking based on a model of raw depth images [15],[16]. These measurements are highly nonlinear and non-Gaussian, for which particle filters are well suited.

In [15], we propose a method to track the 6-DoF pose ofa rigid object given its shape. Our image model explicitlytakes into account occlusions due to any other objects,which are common in real-world manipulation scenarios.The structure of our formulation makes it suitable to use aRao-Blackwellized particle filter [17], which in combinationwith the factorization of the image model over pixels yieldsa computationally efficient algorithm. This model can beextended to articulated objects given knowledge of the ob-ject’s kinematics and shape. The additional difficulty is thatjoint configurations can be quite high-dimensional (> 30).In [16], we propose a method which alleviates this issue bysampling dimension-wise, and show an application to robotarm tracking.

In this paper we take the latter method further by fusing thevisual data with the measurements from the joint encoders,therefore exploiting their complementary nature. Addition-ally to estimating joint angles and/or biases like [16], [9],[14], [13], [12], we simultaneously estimate the true camerapose relative to the kinematic chain of the robot. Further,differently from most approaches mentioned, we processall joint measurements as they arrive rather than limitingourselves to the image rate, and we handle the delay in theimages. A crucial difference to related work is that we modelnot only self-occlusion of the arm, but also occlusions dueto external objects. Although we take as input fewer cuesthan other methods [11], [14], [13], we already achieve aremarkable robustness and accuracy in robot arm tracking.The dataset we propose in this paper will enable quantitativecomparison to alternative approaches.

III. MODELING

Our goal in robot tracking is to recursively estimate thecurrent joint angles at of the robot given the history of depthimages zt and joint angle measurements qt. The kinematicsof the robot and shape of its limbs are assumed to beknown, so the joint angles are enough to describe the fullconfiguration of the robot, and in particular the pose of theend effector. We further define a number of auxiliary latentvariables, which allow to explain the mismatch between thereadings from the joint encoders and the robot configurationobserved in the depth images:

– We augment the set of joints in the kinematic modelwith six extra virtual joints. These do not correspond tophysical links, but represent a translation and a rotationbetween the nominal camera pose (i.e. as specified in thekinematic model) and the true camera pose. The nominalcamera pose can be measured in advance, at least roughly,if not provided by the manufacturer.

– At each joint j, there is a bias bjt that perturbs the jointmeasurement qjt .

bjt�1 bjt bjt+1

qjt�1 qjt qjt+1

ajt�1 ajt ajt+1

zit�1 zit zit+1

oit�1 oit oit+1

......

......j

i

Kalman filtering of

joint measurements

CPF updates with

depth images

Fig. 2. Bayes network for our model (Section III). Shaded nodes areobserved (joint measurements q and depth images z). White nodes are latent(angles a, biases b and occlusions o). t, j and i are indices for discrete time,joints and pixels. Our inference algorithm (Section IV) combines Kalmanfiltering of joint measurements and image updates based on the CPF [16].

zit

|di(at)

p(z

i t|a

t)

Fig. 3. Distribution of measured depth at one pixel for different occlusionprobabilities p(oit = 1), from lower (dotted blue) to higher (solid green).

– For each pixel i of the depth image zt, a binary variableoit indicates whether an external occluder is present.The dependences among variables are shown in the Bayesnetwork in Fig. 2. In the remainder of this section, we definethe process and measurement distributions that connect them,so as to fully specify our model. In Section IV we providean algorithm for inference.

A. Camera model

In this paper, we use the same model for depth images asin [15], which factorizes over pixels

p(zt | at, ot) =∏i

p(zit | at, oit). (1)

We consider three sources of error between the rendered andmeasured depths: (i) inaccuracies in the mesh model, whichwe model as Gaussian noise; (ii) external occlusions, whichcause the measured depth to be lower than the distance tothe target object; and (iii) noise in the depth sensor, whichwe model as a mixture of a Gaussian distribution around thedistance to the closest object and a uniform distribution inthe range of the sensor. See [15] for the detailed expressions.

Fig. 3 shows an example of the resulting distributionof measured depth at a pixel after marginalizing out theocclusion variable

p(zit | at) =∑

oit={0,1}

p(zit | at, oit)p(oit), (2)

for different occlusion probabilities p(oit = 1). It has apeak around the rendered depth di(at), and a thick tailat lower depths, which accounts for the possibility of anocclusion. The peak becomes less pronounced at higherocclusion probability.

A difference between the present model (Fig. 2) and themodel used in [15] is that here the occlusion probabilities areestimated at each step, but not propagated over time. This isa simplification we make to ensure tractability of the moreinvolved filtering problem in the present paper. As will beshown in the experimental section, the resulting algorithm isnevertheless very robust to occlusion.

B. Joint encoder model

We model the joint measurement as the sum of angle, bias,and independent Gaussian noise for each joint j:

p(qjt | ajt , b

jt ) = N (qjt | a

jt + bjt , σ

2q ). (3)

C. Angle process model

The angle of each joint j follows a random walk

p(ajt+1 | ajt ) = N (ajt+1 | a

jt ,∆σ2

a) (4)

where ∆ is the length of the time step.1 Parameter σa needsto be big enough to capture fast angle dynamics.

This simple model works well in our experiments, likelybecause of the frequent measurements (every 1-3 ms) withlittle noise. An interesting question for future work iswhether the performance could be improved by using a morecomplex model, taking into account rigid body dynamics andthe control commands sent to the robot.

D. Bias process model

We model the bias such that its variance does not growindefinitely large in the absence of measurements. A simplelinear model that achieves this is the following random walk

p(bjt+1 | bjt ) = N (bjt+1 | c∆b

jt ,∆σ2

b ), c < 1, (5)

where σb denotes the noise standard deviation, c is a param-eter specifying how fast the process tends to 0, and ∆ is thetime step length.1

To check that this process would behave as desired, andto gain some intuition on how to choose the parameters, itis helpful to look at its asymptotic behavior in the absenceof measurements. We can obtain the asymptotic distributionby taking the distributions at two consecutive time steps

p(bjt ) = N (bjt |µ∗, σ2∗) (6)

p(bjt+1) = N (bjt+1 | c∆µ∗,∆σ2b + c2∆σ2

∗) (7)

and equating their means and variances, which yields

µ∗ = 0; σ2∗ =

∆

1− c2∆σ2b (8)

for any σb > 0 and c < 1. We can see that in the absence ofmeasurements the mean of the bias converges to zero (which

1The dependence on ∆ arises from the integration over time of acontinuous process with white noise.

is reasonable when there is no data suggesting otherwise),and the variance converges to some constant which dependson the choice of c and σb. Equation (8) can help us findmeaningful values for these parameters.

IV. ALGORITHM

Having defined our process and measurement models, ourgoal now is to find an algorithm for recursive inference. Thatis, we want to obtain the current belief p(at, bt | z1:t, q1:t)from the latest measurements zt and qt, and the previousbelief p(at−1, bt−1 | z1:t−1, q1:t−1).

Joint and depth measurements are typically generated atdifferent rates in real systems, which makes it useful to beable to separately incorporate a joint measurement or a depthmeasurement to our belief at any point in time. We assumethat we receive joint measurements at a high rate, so wechoose the time interval between joint measurements to bethe basic time step at which we want to produce estimates.Further, we have to cope with a delay in the depth image,as explained in Section IV-C.

A. Filtering joint measurements

Defining ht = {z1:t−1, q1:t−1} for brevity, the incorpora-tion of a joint measurement can be written as

p(at, bt | qt, ht) ∝ p(qt | at, bt) ·∫at−1,bt−1

p(at | at−1)p(bt | bt−1)p(at−1, bt−1 |ht) (9)

where we used our assumption that the angle process (4) andthe bias process (5) are independent. The models involvedin (9) are (3, 4, 5). All of them are linear Gaussian, hence(9) has a closed-form solution, which corresponds to onetime step of a Kalman Filter (KF) [18]. Furthermore, allthese models factorize in the joints. Hence, if the initialbelief p(at−1, bt−1 |ht) factorizes too, we can filter withan independent KF for each joint, which greatly improvesefficiency. We will see in the following how we keep thebelief Gaussian and factorize in the joints at all times, sothat this is indeed the case.

B. Updating with depth images

Let ht = {qt, ht} be the history of measurements beforethe image update. Each time a depth image is received, weupdate the belief obtained in (9), i.e. p(at, bt | ht), to get thedesired posterior p(at, bt | zt, ht).

The goal of this section is to write this update in sucha form that we can apply our previous work [16], so as tohandle the high-dimensional state efficiently.

We begin by noting the following equalities:

p(at, bt | zt, ht) =p(zt | at, bt, ht)p(at, bt | ht)

p(zt | ht)(10)

=p(zt | at, ht)p(at | ht)p(bt | at, ht)

p(zt | ht)(11)

= p(at | zt, ht)p(bt | at, ht), (12)

where we used that our image model does not depend onthe bias. According to (12), the desired posterior can bedecomposed into two terms. The first is the posterior in theangle only, which we will derive in the following. The secondis easily obtained from the prior (9) by conditioning on at,which for a Gaussian can be done in closed form [19, p. 90].

1) Posterior in the angle: We can write the posterior inthe angle as

p(at | zt, ht) ∝ p(zt | at)p(at | ht), (13)

where the first term is the image observation model (1), andthe second is readily obtained from the Gaussian prior (9)by marginalizing out bt [19, p. 90].

Since this update only involves the image zt and the jointangles at, we can solve it in the same manner as we did in[16]. This involves sampling from p(at | ht) dimension-wise,and weighting with the likelihood p(zt | at). For more details,we refer the reader to [16], [15] – what is important here isthat this step creates a set of particles {(l)at}l distributedaccording to (13).

2) Gaussian approximation: After incorporating a depthimage, we approximate the particle belief (13) with a Gaus-sian distribution factorizing in the joints

p(at | zt, ht) ≈∏j

N (ajt |µjt ,Σ

jt ) (14)

Moment matching is well known to be the minimumKL-divergence solution for Gaussian approximations [19,pp. 505-506]. Therefore, we make

µjt =

1

L

L∑l=1

(l)ajt , Σjt =

1

L

L∑l=1

((l)ajt − µjt )

2. (15)

3) Full posterior: Since we approximated both terms in(12) by Gaussians which factorize in the joints, the fullposterior is of the same form

p(at, bt | zt, ht) =∏j

N (ajt , bjt |m

jt ,M

jt ). (16)

Hence, we can continue to filter joint measurements usingindependent KFs for each joint, as mentioned in Section IV-A. The parameters mj

t and M jt are easily obtained using stan-

dard Gaussian manipulations, e.g. ‘completing the square’ inthe exponent [19, p. 86].

Approximating distributions by factorized distributions isa common practice in machine learning to ensure tractabil-ity, in particular in variational inference. In our case, thisfactorization allows us to have n KFs on a 2-dimensionalstate each (with complexity O(n)), instead of one KF on a2n-dimensional state (with complexity O(n3)).

C. Taking into account image delay

An additional difficulty is that images are very data heavy,which often leads to delays in acquisition and transmission.This means that at time t we might receive an image withtime stamp t′ < t. However, all the joint angles between t′

and t have been processed already, and our current belief isp(at, bt | qt, ht). Our solution to this problem is to maintain

a buffer of beliefs and joint measurements. When an imageis received, we find the belief p(at′ , bt′ | qt′ , ht′) with theappropriate time stamp, and incorporate the image zt′ intothis belief according to Section IV-B. Once this is done, were-filter the joint measurements in the buffer which have atime stamp > t′ according to Section IV-A to obtain thecurrent belief. Therefore, the filtering of joint angles has tobe extremely fast, which is possible due to the factorizationin the joints.

D. Efficiency and implementation

We implemented our tracker as two filters executed inparallel at different rates, which can be reinitialized fromeach other’s belief at any time as described above. Thecomponents which make the proposed method real-timecapable are:

– the factorization in the pixels of our depth measurementmodel [15];

– the use of the Coordinate Particle Filter [16], whichallows to filter in the high dimensional joint space of therobot with relatively few particles;

– the factorization of the belief in the joints as describedin Section IV-B, which allows to apply independent Kalmanfilters for each joint; and

– our GPU implementation for computing the depth im-age likelihoods for all particles in parallel [20].Our code can be used off-the-shelf by providing the robot’smodel in Unified Robot Description Format. Other nicefeatures are the optional automatic injection of the virtualjoints into the robot model, and that it is easily configurableto account for a constant offset in the image time stamp.

V. EXPERIMENTAL SETUP

To evaluate our approach towards robust and accuraterobot tracking, we recorded data from two different roboticplatforms. This section describes these platforms, the typeof data, baselines and evaluation measures. We make thisdataset public, together with the robot models and evalua-tion code, so as to allow comparison among methods andreproduction of the results here.

A. Robotic platforms

We recorded data on two different robotic platforms.They are both fixed-base dual-arm manipulation platformsequipped with two three-fingered Barrett Hands and an RGB-D camera (Asus Xtion) mounted on an active head. Theydiffer in the source of error that leads to an inaccurate hand-eye coordination.

Apollo (Fig. 1a) is equipped with two 7-DoF Kuka LWRIV arms with very accurate joint encoders. The active hu-manoid head on which the RGB-D camera is mounted hasa mechanism consisting of two four-bar linkages that areconnected at the head and generate a coupled motion. Weuse an approximation to easily compute the 3 rotary DoFof the neck from linear joint encoders. This causes non-linear, configuration-dependent error in the pose of the head-mounted camera.

The ARM robot (Fig. 1b) consists of two 7-DoF BarrettWAM arms that are actuated using a cable-driven mech-anisms with motors and encoders in the shoulder. Uponcontact with the environment or during lifting a heavy object,the resulting cable stretch is not observable through theencoders and therefore leads to a time-varying bias on thejoint angles. The active head consists of two stacked pan-tiltunits with simple kinematics and accurate encoders.

While the dominant error for Apollo comes from theinaccurate kinematic model from the base to the camera,hand-eye coordination on the ARM robot is mostly perturbedby biases in the joint encoder readings. With this kind of data,we cover the most important errors that cause inaccuratehand-eye coordination in many robotics systems.

B. Data from Apollo

The data recorded on Apollo consists of time-stampedRGB-D images (although our method uses only depth), andmeasurements from the joint encoders. We also measured thepose of the head and end effector using a VICON motion-capture system. After a calibration procedure to align thecamera frame to the VICON system, this provides groundtruth poses of the end effector with regards to the camera.

The dataset has seven 60-second-long sequences recordedwhile Apollo was in gravity compensation, and where itsright arm and/or head is moved by a person. The sequencesinclude fast motion (labeled with the ‘++’ suffix) and simul-taneous arm and head motion (‘both’). One of the sequencescontains heavy, long-term occlusions of the moving arm(‘occlusion’). Another contains a long period in which thearm is out of the camera’s field of view (‘in/out/in’). Infive additional 30-second-long sequences, Apollo performssinusoidal motion (prefix ‘s-’) of lower, upper and full arm,head motion, and simultaneous head and arm motion.

As mentioned in Section V-A, the kinematics of the headare quite inaccurate, while they are precise in the arm.Therefore, we extend this dataset by simulating bias in thejoint sensors of the arm. We do this by adding an offset tothe measured joint angles, while leaving the RGB-D imagesand ground-truth poses intact. One version of the simulateddataset has a constant offset of 8.6 degrees in each arm joint.Another has a time-varying offset on each joint, consisting ofsmooth steps of alternating sign and 5 degrees of amplitude.

C. Data from ARM

In the ARM robot, the joint measurements are contami-nated with real noise and bias, but we do not have groundtruth labels. We have several sequences of depth images andjoint measurements of the robot moving its two arms insinusoidal waves at different frequencies. We use this datafor qualitative evaluation.

D. Performance measures

Our method produces estimates of joint angles, and acorrection of the camera pose with regards to the robot. Inparticular, this provides the pose of the estimated end effector(the hand) in the estimated camera frame. We compare this

pose to the ground-truth hand-to-camera pose provided bythe VICON system. We use as performance measures thetranslational and angular error.

These methods will be compared in the following section:– Encoders only: a baseline that predicts the hand-to-

camera pose by simply applying forward kinematics on thekinematic model, using the raw measured angles.

– Camera offset only: a variant of our method that as-sumes the joint measurements contain no bias, and onlyestimates the virtual joints describing the camera offset, i.e.performs online extrinsic calibration of the camera with re-gards to the kinematic chain. It uses both joint measurementsand depth images.

– Full fusion: our full method, with parameters allowingto express large biases. It fuses joint measurements andvision at every joint, and also estimates virtual joints.

– Vision only: a variant of our method that relies only onimages, similar to the experiments in [16].

To summarize the performance of each method in thedataset, we use box plots with one bar per method andsequence. Each bar in the plot represents statistics of thetranslational or angular error obtained by the method on thesequence, taken at regular time intervals, aggregated overthe length of the sequence, and further aggregated over 10runs. See for example Fig. 4. The black ticks and limits ofthe colored bars indicate the 1st, 25th, 50th, 75th and 99thpercentiles.

E. Parameters

The parameters of the image observation model are asin [15]; they are not specific to robot tracking. The newparameters in the proposed method are the noise σq in theencoder model (3), the noise σa in the angle process model(4), and finally the parameters c and σb of the bias process(5). All these parameters have a physical meaning and werechosen in an intuitive manner with only minimal tuning.

There is a trade-off between the magnitude of bias themethod can absorb, and the accuracy/smoothness of theestimate. Nevertheless, we used the same parameter set forall the quantitative experiments in the following section. Inpractice, one could achieve some improvement in accuracyby adapting the parameters to the situation, e.g. reducing σb

or c if we know the bias of the encoders to be low.

VI. EVALUATIONA. Quantitative evaluation on real data

In this experiment, we show the benefit of estimating thevirtual joints to correct inaccuracies in the head kinematics.We use real annotated data from Apollo.

Because of the properties of this robot described above, itis safe to assume accurate joint measurements, i.e. no bias,and only estimate the virtual joints. This corresponds to thecamera-offset-only method. We compare it to the encoders-only baseline in Fig. 4. Clearly, estimating the camera offsetdecreases the error significantly. The 75th-percentile errorwithout estimating the camera offset is in the order of a fewcentimeters. The camera offset correction allows to reduce

this error to a few millimeters. The 99th percentile is stillrelatively high after correction. This is because it takes sometime in the beginning of a sequence for the camera offset toconverge to the correct value. But once this is achieved, weconsistently obtain a low error.

The full-fusion method estimates the bias in the jointangles in addition to the camera offset. We see in Fig. 4that its accuracy is very similar to the camera-offset-onlymethod. This is expected, due to the absence of biases in thejoints of the Apollo robot.

B. Quantitative evaluation on simulated biases

This section shows how our system is capable of dealingwith strong, time-varying biases on the joint measurementsin a range of conditions. We use the same data as for theprevious experiment, except that here we corrupted the jointmeasurements, as described in Section V-B.

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

transl

ati

onal err

or

(cm

)

s-u-arm

s-l-arm

s-arm

s-head

s-both arm both

arm++

both++

occlusio

n

in/out/in static

0

1

2

3

4

5

anglu

lar

err

or

(deg)

encoders only

camera offset only

full fusion

Fig. 4. Performance on real data with accurate joint encoders.

s-u-arm

s-l-arm

s-arm

s-head

s-both arm both

arm++

both++

occlusio

n

in/out/in static

0

5

10

15

20

25

30

transl

ati

onal err

or

(cm

)

encoders only

camera offset only

full fusion

vision only

s-u-arm

s-l-arm

s-arm

s-head

s-both arm both

arm++

both++

occlusio

n

in/out/in static

0

5

10

15

20

25

30

35

40

anglu

lar

err

or

(deg)

encoders only

camera offset only

full fusion

vision only

Fig. 5. Performance on data with simulated constant biases.

s-u-arm

s-l-arm

s-arm

s-head

s-both arm both

arm++

both++

occlusio

n

in/out/in static

0

2

4

6

8

10

12

14

16

transl

ati

onal err

or

(cm

)

encoders only

camera offset only

full fusion

vision only

s-u-arm

s-l-arm

s-arm

s-head

s-both arm both

arm++

both++

occlusio

n

in/out/in static

0

5

10

15

20

25

anglu

lar

err

or

(deg)

encoders only

camera offset only

full fusion

vision only

Fig. 6. Performance on data with simulated time-varying biases.

As we can see in Fig. 5, the error in the end-effectorposition can be as high as 25 cm when applying forwardkinematics to the biased measurements. While estimating thecamera offset decreases this error by several centimeters, weneed the full fusion to bring it down to the order of themillimeters. As shown in Fig. 7a, the full-fusion methodrequires some time to correct the large bias, but once itconverged, it consistently yields precise estimates.

The case of the time-varying bias (Fig. 6) is more chal-lenging, because of the sudden bias changes of up to 10degrees. In the time-series plots in Fig. 7 we can see howthe fusion filter needs some time to adapt its estimate aftereach change, so its error is higher during this transition.

It is interesting to compare the behavior of the fusiontracker to the purely visual one. The latter does not usejoint measurements, so its performance does not suffer fromperturbations in them, see e.g. Fig. 7d. However, it easilyloses track during occlusion, fast motions, or when the robotarm is out of view. In contrast, the estimate of the fusiontracker stays at least as accurate as the encoders-only method,even when the visual information is not reliable, e.g. duringthe strong occlusions in Fig. 7e and 7c, see Fig. 1c for anexample frame. Similarly, when the arm goes out of view(Fig. 7b) the fusion tracker’s estimates are pulled towardsthe joint measurement, so the error increases. But then itcan recover when the arm is within view again.

C. Qualitative evaluation

Fig. 1c shows one shot of the Apollo sequence withstrong occlusions and large simulated constant bias. Wecan see that our estimate remains accurate despite mostof the arm being covered by a person. Fig. 1d shows realdepth data from the ARM robot. It gives an idea of theamount of error produced by real bias. More examples ofqualitative evaluation can be found in the supplementaryvideo (https://youtu.be/YNP9UCx6Wa4) showing robust andaccurate tracking during simultaneous arm and head motion,fast arm motion and when the arm is going in and out ofview.

VII. CONCLUSION

We proposed a probabilistic filtering algorithm whichfuses joint measurements with depth images to achieverobust and accurate robot arm tracking. The strength of ourapproach lies in modeling the measurements in an intuitive,generative way that is realistic enough to achieve highprecision, while keeping in mind tractability. In our case, thisimplied introducing and explicitly estimating auxiliary latentvariables such as pixel occlusions, encoder biases and cameraoffset, which enable a good fit of the data. Some principledapproximations, as well as building on previous work suchas [16], [20], made it possible to derive a computationallyefficient algorithm and real-time implementation.

In this paper, we showed that our method performs quan-titatively well under the challenging conditions of the datasetwe propose, including fast motion, significant and long-termocclusions, time-varying biases, and the robot arm getting in

0 10 20 30 40 50time (s)

0

5

10

15

20

25tr

ansl

ati

onal err

or

(cm

)vision only

encoders only

camera offset only

full fusion

0 10 20 30 40 50time (s)

0

5

10

15

20

25

angula

r err

or

(deg)

(a) simultaneous head and arm motion

0 10 20 30 40 50time (s)

0

5

10

15

20

25

transl

ati

onal err

or

(cm

)

vision only

encoders only

camera offset only

full fusion

0 10 20 30 40 50time (s)

0

5

10

15

20

25

angula

r err

or

(deg)

(b) arm goes out of view and back

0 10 20 30 40 50time (s)

0

5

10

15

20

25

transl

ati

onal err

or

(cm

)

vision only

encoders only

camera offset only

full fusion

0 10 20 30 40 50time (s)

0

5

10

15

20

25

angula

r err

or

(deg)

(c) strong occlusions

0 10 20 30 40 50time (s)

0

2

4

6

8

10

12

14

transl

ati

onal err

or

(cm

)

vision only

encoders only

camera offset only

full fusion

0 10 20 30 40 50time (s)

0

5

10

15

20

25

angula

r err

or

(deg)

(d) simultaneous head and arm motion

0 10 20 30 40 50time (s)

0

2

4

6

8

10

12

14

transl

ati

onal err

or

(cm

)

vision only

encoders only

camera offset only

full fusion

0 10 20 30 40 50time (s)

0

5

10

15

20

25

angula

r err

or

(deg)

(e) strong occlusions

Fig. 7. Example run on sequences with constant (top row) and time-varying (bottom row) simulated biases.

and out of view. Further, we have already demonstrated howthis method can be integrated into an entire manipulationsystem that simultaneously tracks robot arm and objectto enable pick and place tasks in uncertain and dynamicenvironments [21].

Our system is already very robust when tracking thearm using only depth images and joint measurements. Aninteresting direction for future work is to fuse data fromhaptic sensors, which may further improve performancewhen simultaneously estimating object and arm pose duringmanipulation.

REFERENCES

[1] P. Pastor, M. Kalakrishnan, J. Binney, J. Kelly, L. Righetti,G. Sukhatme, and S. Schaal, “Learning task error models for manip-ulation,” in IEEE Intl Conf on Robotics and Automation, May 2013,pp. 2612–2618.

[2] K. Pauwels and D. Kragic, “Integrated on-line robot-camera calibra-tion and object pose estimation,” in IEEE Intl Conf on Robotics andAutomation, May 2016, pp. 2332–2339.

[3] V. Pradeep, K. Konolige, and E. Berger, “Calibrating a multi-armmulti-sensor robot: A bundle adjustment approach,” in Intl Symposiumon Experimental Robotics (ISER), New Delhi, India, Dec 2010.

[4] N. Vahrenkamp, C. Boge, K. Welke, T. Asfour, J. Walter, and R. Dill-mann, “Visual servoing for dual arm motions on a humanoid robot,”in IEEE-RAS Intl Conf on Humanoid Robots, 2009, pp. 208–214.

[5] C. Choi and H. I. Christensen, “Real-time 3D model-based trackingusing edge and keypoint features for robotic manipulation,” in IEEEIntl Conf on Robotics and Automation, May 2010, pp. 4048–4055.

[6] D. Kragic, A. T. Miller, and P. K. Allen, “Real-time tracking meetsonline grasp planning,” in IEEE Intl Conf on Robotics and Automation,vol. 3, 2001, pp. 2460–2465.

[7] X. Gratal, J. Romero, J. Bohg, and D. Kragic, “Visual servoing onunknown objects,” Mechatronics, vol. 22, no. 4, pp. 423 – 435, 2012.

[8] S. Hinterstoisser, C. Cagniart, S. Ilic, P. F. Sturm, N. Navab, P. Fua,and V. Lepetit, “Gradient response maps for real-time detection oftextureless objects,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34,no. 5, pp. 876–888, 2012.

[9] M. Klingensmith, T. Galluzzo, C. Dellin, M. Kazemi, J. A. D. Bagnell,and N. Pollard , “Closed-loop servoing using real-time markerless armtracking,” in IEEE Intl Conf on Robotics and Automation (HumanoidsWorkshop), May 2013.

[10] P. Besl and N. D. McKay, “A method for registration of 3-D shapes,”IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 14, no. 2, pp. 239–256, 1992.

[11] K. Pauwels, V. Ivan, E. Ros, and S. Vijayakumar, “Real-time objectpose recognition and tracking with an imprecisely calibrated movingRGB-D camera,” in IEEE/RSJ Intl Conf on Intelligent Robots andSystems, Chicago, Illinois, 2014.

[12] M. Krainin, P. Henry, X. Ren, and D. Fox, “Manipulator and objecttracking for in-hand 3D object modeling,” The Intl Journal of RoboticsResearch, 2011.

[13] P. Hebert, N. Hudson, J. Ma, T. Howard, T. Fuchs, M. Bajracharya,and J. Burdick, “Combined shape, appearance and silhouette forsimultaneous manipulator and object tracking,” in IEEE Intl Conf onRobotics and Automation, 2012.

[14] T. Schmidt, K. Hertkorn, R. A. Newcombe, Z. Marton, M. Suppa,and D. Fox, “Depth-based tracking with physical constraints for robotmanipulation,” in IEEE Intl Conf on Robotics and Automation, May2015, pp. 119–126.

[15] M. Wuthrich, P. Pastor, M. Kalakrishnan, J. Bohg, and S. Schaal,“Probabilistic object tracking using a range camera,” in IEEE/RSJ IntlConf on Intelligent Robots and Systems, 2013.

[16] M. Wuthrich, J. Bohg, D. Kappler, P. C., and S. S., “The CoordinateParticle Filter - a novel Particle Filter for high dimensional systems,”in IEEE Intl Conf on Robotics and Automation, May 2015.

[17] A. Doucet, N. de Freitas, K. Murphy, and S. Russell, “Rao-Blackwellised particle filtering for dynamic Bayesian networks,” inProc of the 16th Conf on Uncertainty in Artificial Intelligence, 2000,pp. 176–183.

[18] R. E. Kalman, “A New Approach to Linear Filtering and PredictionProblems,” Transactions of the ASME - Journal of Basic Engineering,no. 82 (Series D), pp. 35–45, 1960.

[19] C. M. Bishop, Pattern Recognition and Machine Learning (Informa-tion Science and Statistics). Secaucus, NJ, USA: Springer-VerlagNew York, Inc., 2006.

[20] C. Pfreundt, “Probabilistic object tracking on the GPU,” Master’sthesis, Karlsruhe Institute of Technology, Mar. 2014.

[21] J. Bohg, D. Kappler, F. Meier, N. Ratliff, J. Mainprice, J. Issac,M. Wuthrich, C. Garcia Cifuentes, V. Berenz, and S. Schaal, “In-terlocking perception-action loops at multiple time scales - a systemproposal for manipulation in uncertain and dynamic environments,”in Intl Workshop on Robotics in the 21st Century: Challenges andPromises, Sep 2016.

Date post:	02-Oct-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Probabilistic Articulated Real-Time Tracking for Robot...

Documents