+ All Categories
Home > Documents > Augmenting analytic SFM filters with frame-to-frame features

Augmenting analytic SFM filters with frame-to-frame features

Date post: 14-Nov-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
15
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/265731641 Augmenting Analytic SFM Filters with Frame- to-Frame Features ARTICLE in COMPUTER VISION AND IMAGE UNDERSTANDING · DECEMBER 2014 Impact Factor: 1.54 · DOI: 10.1016/j.cviu.2014.08.003 READS 42 3 AUTHORS, INCLUDING: Daniel Asmar American University of Beirut 50 PUBLICATIONS 135 CITATIONS SEE PROFILE John S. Zelek University of Waterloo 115 PUBLICATIONS 616 CITATIONS SEE PROFILE All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. Available from: John S. Zelek Retrieved on: 05 February 2016
Transcript

Seediscussions,stats,andauthorprofilesforthispublicationat:https://www.researchgate.net/publication/265731641

AugmentingAnalyticSFMFilterswithFrame-to-FrameFeatures

ARTICLEinCOMPUTERVISIONANDIMAGEUNDERSTANDING·DECEMBER2014

ImpactFactor:1.54·DOI:10.1016/j.cviu.2014.08.003

READS

42

3AUTHORS,INCLUDING:

DanielAsmar

AmericanUniversityofBeirut

50PUBLICATIONS135CITATIONS

SEEPROFILE

JohnS.Zelek

UniversityofWaterloo

115PUBLICATIONS616CITATIONS

SEEPROFILE

Allin-textreferencesunderlinedinbluearelinkedtopublicationsonResearchGate,

lettingyouaccessandreadthemimmediately.

Availablefrom:JohnS.Zelek

Retrievedon:05February2016

Computer Vision and Image Understanding 129 (2014) 1–14

Contents lists available at ScienceDirect

Computer Vision and Image Understanding

journal homepage: www.elsevier .com/ locate/cviu

Augmenting analytic SFM filters with frame-to-frame features q

http://dx.doi.org/10.1016/j.cviu.2014.08.0031077-3142/� 2014 Elsevier Inc. All rights reserved.

q This paper has been recommended for acceptance by Sven Dickinson.⇑ Corresponding author.

E-mail addresses: [email protected] (A. Fakih), [email protected](D. Asmar), [email protected] (J. Zelek).

1 Office: Raymond Ghosn 410, American University of Beirut, Beirut, Lebanon.2 Office: DWE 2513G, University of Waterloo, Waterloo, Canada. Fax: +1 519 746

4791.

Adel Fakih a,⇑, Daniel Asmar b,1, John Zelek a,2

a Department of Systems Design Engineering, University of Waterloo, 200 University Avenue West, Waterloo, Ontario N2L 3G1, Canadab American University of Beirut, P.O. Box 110236/(Mechanical Engineering), Riad ElSolh/Beirut 1107 2020, Lebanon

a r t i c l e i n f o

Article history:Received 25 May 2013Accepted 28 August 2014Available online 16 September 2014

Keywords:Structure from motionFilteringComplexity reductionFrame-to-frame features

a b s t r a c t

In Structure From Motion (SFM), image features are matched in either an extended number of frames oronly in pairs of consecutive frames. Traditionally, SFM filters have been applied using only one of the twomatching paradigms, with the Long Range (LR) feature technique being more popular because of the factthat features that are matched across multiple frames provide stronger constraints on structure andmotion. Nevertheless, Frame-to-Frame (F2F) features possess the desirable property of being abundantbecause of the large similarity that exists between closely spaced frames. Although the use of suchfeatures has been limited mostly to the determination of inter-frame camera motion, we argue thatsignificant improvements can be attained in online filter-based SFM by integrating the F2F features intofilters that use LR features. The main contributions of this paper are twofold. First, it presents a newmethod that enables the incorporation of F2F information in any analytical filter in a fashion that requiresminimal change to the existing filter. Our results show that by doing so, large increases in accuracy areachieved in both the structure and motion estimates. Second, thanks to mathematical simplifications werealize in the filter, we minimize the computational burden of F2F integration by two orders of magni-tude, thereby enabling its real-time implementation. Experimental results on real and simulated dataprove the success of the proposed approach.

� 2014 Elsevier Inc. All rights reserved.

1. Introduction It is for this reason that such features have traditionally been

Two categories of image features can be used in recovering the3D motion and/or scene structure from video images (Fig. 1):

1. Long Range (LR) features which are tracked over an extendednumber of frames: this type of feature introduces 3D-to-2Dconstraints linking the scene structure and the 2D motion tothe projection of the features in the images. These constraintsallow the recovery of both the 3D motion and the scene struc-ture and most of the approaches use this category of features[4,20,21,14,3,24].

2. Frame-to-Frame (F2F) features that are matched between onlyeach two consecutive frames: this type of feature is generallynot robust enough to be used for the estimation of the structureof the scene because those features provide, for a given 3Dpoint, only two image projections in two spatially close frames.

used mostly to pose constraints on the motion between thetwo corresponding frames. Such constraints constitute a mea-surement of the differential motion (velocity or incrementalchange of motion) in contrast to LR features that provide an‘‘absolute’’ measurement of the motion and structure. The reli-ability of this type of differential measurement stems from thelarge number of F2F features that can be matched between con-secutive frames.

F2F features have been used in some analytic recursive filtersfor the purpose of motion estimation such as the essential filterof Soatto et al. [27] using the epipolar constraint as a measurementequation. Soatto and Perona also introduced the subspace filter[26] using the subspace method of Jepson and Heeger [10] basedon optical flow as a measurement mechanism. However, suchapproaches suffer from two major limitations. First, the translationmagnitude between different frames cannot be estimated relativeto a common gauge, and hence the obtained estimates cannot beintegrated together in order to determine the absolute motion. Sec-ond, only the motion can be estimated reliably. Furthermore, suchfilters have cubic computational complexity in terms of the F2Ffeatures and hence are not able to run in real-time with a largenumber of features. We postulate here that F2F information is

Fig. 1. Two types of image measurements for SFM estimation: LR features tracked over many frames (solid lines) and F2F features matched only in couples of consecutiveframes (dashed arrows).

2 A. Fakih et al. / Computer Vision and Image Understanding 129 (2014) 1–14

the most beneficial if used to augment multiple-frame trackingsystems which should result in better estimates for both structureand motion. In fact, F2F features have been used in this fashion inthe context of offline Bundle Adjustment (BA) by Zhang and Shan[33] and also in Particle Filters by Eade and Drummond [3] but onlyfor the sake of weighting the particles.

In the context of analytic SFM filters (i.e., filters using an analyticrepresentation of the distribution of the state vector as opposed toparticles), we are not aware of any approach that augments filtersusing LR features with F2F features. This is mainly due to two obsta-cles. First, adding F2F measurements to an SFM system results inone additional measurement equation for each F2F feature. This issomehow problematic as the additional equation is an implicitone meaning that it is defined as an non-linear expression contain-ing both measured and unknown variables instead of the standardexplicit form, where the measurement variables are expressed interms of the unknown variables. As a result, the solution of SFMwould require a treatment that is dependent on the type of filteringthat is adopted. For instance, with the Extended Kalman Filter (EKF)a linearization solution (such as a first order Taylor series expansionabout the measurements) is required, while the Unscented KalmanFilter (UKF) [8] and Particle Filter (PF) [20] can handle the implicitmeasurements directly. Second, the real benefit of F2F features liesin involving as many of those features as possible, however, as thecomputational complexity is cubic in the number of F2F features,incorporating a large number of those features in an online filteris computationally prohibitive.

This paper presents a solution to the two aforementioned obsta-cles. To overcome the problem of the dependency on the type of fil-ter, we propose to incorporate the F2F features in an extra filteringiteration that is performed immediately after every iteration of themain-filter. In theory, given two independent sets of observations,performing filtering using both sets simultaneously is exactlyequivalent to filtering using one of the sets then filtering usingthe other. The numerous benefits of having a separate filteringinclude:

� The separate filtering stage can easily augment any estimatoras long as it maintains a mean vector and a covariancematrix. It can be added to existent implementations withminimal changes to existing code. In fact, it can be codedas a generic function taking as arguments the state vector,its covariance and the F2F features. Then, augmenting anyfilter with F2F features would amount only to calling thisfunction with the appropriate arguments after every

iteration of the main filter. The experimental results sectionshows how the proposed approach can be used to straight-forwardly augment Davison et al.’s EKF SLAM system [2].

� The results of the filtering can be accepted or rejected basedon some criteria such as the epipolar error of the F2F fea-tures, the number of outliers or the extent of change in thestate vector. This helps avoid performing the update in thecase where, for one reason or another, the F2F features arecontaminated with a large number of outliers

� The separate filtering stage can be divided into several inde-pendent steps, which allows the use of robust estimationtechniques such as RANdom SAmple Consensus (RANSAC)[5].

� As one of the problems of F2F features is a significant num-ber of outliers in some situations, carrying out the F2F filter-ing as an extra step provides an opportunity to use themotion estimates output by the LR filter to prune the F2Foutliers before performing the F2F filtering.

� Most importantly, the measurements in the separate filter-ing stage (F2F features) provide only a partial observationof the state vector. In this paper we show that this fact canbe exploited in order to reduce the cost of the F2F filteringby two orders of magnitude.

The cost of the extra filtering is initially cubic in the number ofadded F2F features and in the size of the filtered state vector. Twoaspects are exploited to reduce this cost. First, the noise vectors indifferent F2F features are assumed to be statistically independentand hence their covariance matrix is block diagonal (this assump-tion is usually taken by most SFM techniques and is considered tobe realistic enough even if it is not a very accurate model of the realnoise). Second, the F2F features provide a partial observation of thestate vector and hence affect directly only a small part of this vec-tor, which is the camera velocity and its global rotation. Theireffect on the other components of the state vector comes aboutonly through the covariance matrix of the state vector. To capital-ize on this fact, the proposed system first updates only the velocityand global rotation estimates using F2F features, then spreads theupdate to the other components of the state vector using thecovariance matrix of the state vector. By using the Sherman–Mor-rison–Woodbury formula for inverting sums of matrices [7], theupdate of the velocity can be performed with linear cost in thenumber of F2F features. The propagation of the update to the othercomponents of the state vector is independent of the number ofF2F features.

A. Fakih et al. / Computer Vision and Image Understanding 129 (2014) 1–14 3

It is important to mention here the relation to another categoryof SFM approaches based on parallel tracking and mapping[12,13,17,30]. Those approaches use two processing phases oftenperformed in two concurrent threads, a front-end used to trackthe camera between key-frames and a back-end used as globaloptimizer, mostly via some kind of bundle adjustment over a fixednumber of frames. Those approaches have been shown [29] to out-perform filter based approaches especially when the camera staysmost of the time in the same environment and when the computa-tional resources are not scarce. However, in many scenarios, filterbased approaches are the better choice. Examples of such scenariosare in tracking applications when the computational resources arescarce such as on embedded systems and when integrating withother sensors or with already existent systems. Furthermore, filterbased approaches could be used as the front end tracking system inparallel tracking and mapping approaches.

Another related approach that uses both short and long base-lines and exploits the fact that matching across frames taken fromvery close viewpoints leads to a large number of good matches, isthe Dense Tracking and Mapping (DTAM) [18] approach. Thisapproach uses a number of frames in the vicinity of a key-frameto build a dense map. All the pixels in those frames are matchedand the reliability is achieved by the large number of frames con-sidered within the short baseline and via spatial regularization per-formed with the aid of a GPU. The tracking between key-frames isperformed by determining the motion that minimizes the differ-ence between the projected map and the next key-frame.

The remainder of this paper is structured as follows. Section 2introduces the problem, notations and formulations. Section 3 pre-sents a high level description of the proposed solution. Section 4addresses the problem of complexity reduction. Finally, Section 5highlights the improvements obtained by applying the proposedapproach via corresponding experimental results.

Fig. 2. The stars represent LR features that are observed from every cameraposition. The circles represent F2F features that are observed only from adjacentpositions.

2. Problem formulation

Consider a 3D camera moving in a scene and making repetitiveobservations of a set of 3D points (features). As shown in Fig. 2,some of the features, Pi; i ¼ 1; . . . ;N, are assumed to be observedby the camera at every time step (LR features), while others,Q j; j ¼ 1; . . . ;K , are only observable in two adjacent positions.

The following notational conventions are assumed in formulat-ing this estimation problem. The transpose of a column vector V iswritten as V 0 and its corresponding normalized version as bV . Thecovariance matrix of a vector V is denoted as RðVÞ. The Jacobianof a function hðVÞ with respect to the vector V is represented asJðh;VÞ. The notation IðNÞ is used to denote an N-dimensionalidentity matrix.

Two coordinate frames of reference are used. A fixed worldcoordinate frame and a camera-tied frame. Vectors representedin the world coordinate frame are superscripted by a w (for exam-ple Tw). Similarly vectors represented in the camera frame aresuperscripted with a c (for example Tc). The camera pose in theworld frame consists of a rotation represented by a quaternionqw (unit 4-vector) and a translation vector Tw. The rotation matrixcorresponding to the quaternion q is denoted RðqÞ and is given by:

RðqÞ ¼1� 2q2

2 � 2q23 2q1q2 � 2q0q3 2q1q3 þ 2q2q0

2q1q2 þ 2q0q3 1� 2q21 � 2q2

3 2q2q3 � 2q0q1

2q1q3 � 2q2q0 2q2q3 � 2q0q1 1� 2q21 � 2q2

2

264

375: ð1Þ

The velocity or instantaneous motion of the camera consists of anangular velocity vector xw (angle-axis representation) and a trans-lational velocity vector Vw (Fig. 3). The rotation matrix correspond-ing to the angular velocity vector x is denoted RðxÞ and is given byRodrigues’ formula:

RðxÞ ¼ I þ sinðjxjÞ½x̂�� þ ð1� cosðjxjÞÞðxx0 � IÞ; ð2Þ

where ½x̂�� is the skew symmetric matrix corresponding to x̂. Therelation between the camera velocity (Vw;xw) in the world frameand in the camera frame is expressed as follows:

V c ¼ RðqwÞ0Vw; and

xc ¼ RðqwÞ0xw;ð3Þ

where the prime on the Rotation matrix stands for its transpose.A 3D point Pw in the world frame is transformed into camera

frame by the equation:

Pc ¼ RðqwÞ0ðPw � TwÞ: ð4Þ

A point observed by the camera at position Q ct at time t will be

observed at position Q ctþ1 at time t þ 1:

Q ctþ1 ¼ Rðxc

t Þ0ðQ c

t � V ct Þ ¼ RðRðqw

t Þ0xw

t Þ0ðQ c

t � Rðqwt Þ0Vw

t Þ: ð5Þ

xw

yw

xc

yc

Tw

qw

V

ω

Fig. 3. The two used coordinate frames.

4 A. Fakih et al. / Computer Vision and Image Understanding 129 (2014) 1–14

Note that the last equation involves not only the velocities ðVw;xwÞbut the global rotation qw likewise. Assuming a calibrated cameraand a pin-hole perspective projection model, the camera observesthe projections on the image plane pc and qc of the LR feature Pc

and the F2F feature Q c respectively. Those projections are given by:

pc ¼

Pc0

Pc2

Pc0

Pc2

1

26664

37775; ð6Þ

and

qc ¼

Qc0

Qc2

Qc0

Qc2

1

26664

37775: ð7Þ

The goal of the SFM problem is to estimate the motion of the sensorTw;qw;Vw and xw in addition to the 3D positions of the LR featuresPw

i from the projections pci and qc

j .

2.1. Measurement model

The measurement model describes how the observed values (pci

and qcj ) are related to the 3D unknown parameters.

For the LR features, substituting Pci in Eq. (6) by its value from

Eq. (4), gives directly an expression of pci in terms of the 3D

parameters.For the F2F features Q c

t and Q ctþ1 at times t and t þ 1, only their

projections are observed and matched. Therefore, the unknowndepth of these features has to be eliminated in order to obtain aconstraint that links the incremental motion between t and t þ 1to the F2F correspondences. The choice of the approach taken toget rid of the depth is crucial for the correctness of the estimation.For example, using an error measure based on the simple epipolarconstraint ðqc

tþ1Þ0Eqc

t ¼ 0, where E is the essential matrix, intro-duces a bias in the estimates because the error ðqc

tþ1Þ0Eqc

t is an alge-braic error whose value depends on the actual values of qc

tþ1 and qct

in contrast to the true geometrical error which depends only onhow far away from their true values qc

tþ1 and qct are. To alleviate

that, many approaches have attempted to provide an approxima-tion of the true geometrical error. For instance, Torr and Murray[32] used Sampson’s approximation to obtain a first-order approx-imation of the geometric epipolar error. Kanatani and Sugaya [11]pointed out that Sampson’s approximation is so good that thedifference is only in insignificant digits. Oliensis [19] derived anoptimal expression of the exact geometrical error that is exactlyequivalent to the full error equation without depth elimination.Using V ;R;q0 and q1 to refer to V cðTÞ; RðxcðtÞÞ; qc

t and qctþ1 respec-

tively, the Oliensis optimal error can be expressed as:

e ¼ a2�

ffiffiffiffiffiffiffiffiffiffiffiffiffiffia2

4� b

r;

a ¼ q̂00ðIð3Þ � bV bV 0Þq̂0 þ q̂01RðIð3Þ � bV bV 0ÞV 0q̂1; and

b ¼ ðbV 0ðq̂0 � R0q̂1ÞÞ2:

ð8Þ

Note that b in the above expression is the same as the regularepipolar error. Combining this error function with (3) leads to anexpression of the error in terms of the camera motion in worldcoordinates of the form

eðqw;Vw;xw;qct ;q

ctþ1Þ ¼ 0: ð9Þ

Stacking all the F2F features at time t in the vector Zf 2ft results in

an multi-dimensional measurement equation of the form:

hðqw;Vw;xw;Zf 2f � nf 2f Þ ¼ hðS;Zf 2f � nf 2f Þ ¼ 0; ð10Þ

where S is the state vector of parameters to be estimated:

S ¼ ½Tw; qw; Vw; xw; Pw1 ; . . . ; Pw

N �; ð11Þ

and Zt is the vector of observations at time t:

Z ¼ ½Pc1;t; . . . ; pc

N;t; qc1;t�1; qc

1;t; . . . ; qcK;t�1; qc

K;t �: ð12Þ

nf 2f is a Gaussian vector representing the noise in the observationvector Zf 2f .

2.2. Dynamic system estimation

The estimation of structure and motion can be cast within adynamic system framework formulated as follows:

Stþ1 ¼ f ðStÞ þ nst ;

hðSt ;Zt � nzt Þ ¼ 0:

ð13Þ

The function f expresses the time evolution of the state vector:

Twtþ1 ¼ Rðxc

t ÞTwt þ V c

t ;

qwtþ1 ¼ qðRðxc

t ÞRðqwt ÞÞ;

Vwtþ1 ¼ Rðxc

t ÞVwt ;

xwtþ1 ¼ Rðxc

t Þxwt ; and

Pwi;tþ1 ¼ Pw

i;t :

ð14Þ

The function h represents the relation between the state vector Sand the measurement vector Z and can be derived by stacking themeasurement equations of the LR and F2F features described in Sec-tion 2.1. Note that the measurement equation is written ashðSt ;Zt � nz

t Þ and not in the more common form Zt ¼ hðStÞ � nzt . This

is because Eq. (9) has this form which is often referred to as implicitform [25] and has important implications on the filtering as will bediscussed below. Finally, ns is random zero mean Gaussian vectorthat represents the uncertainty in the evolution of the system andnZ is Gaussian zero mean vector that represents the noise in thesensor observation.

The dynamic system in Eq. (13) can be solved recursively startingfrom an initial state at time 0 and then using some recursive filter inorder to obtain Stþ1 and RðStþ1Þ from St and RðStÞ. The de factoExtended Kalman Filter (EKF) cannot be used directly in such a casebecause the measurement equation is implicit as mentioned earlier.Therefore, a modification of this filter has to be performed to accountfor this fact. The regular EKF equations to update S are as follows:

K ¼ Jðh; SÞRðSÞJðh; SÞ0 þ RðnzÞ;L ¼ RðSÞJðh; SÞ0K�1;

C ¼ I � LJðh; SÞ;Sþ ¼ S þ LðZ � hðSÞÞ;RðSþÞ ¼ CRðSÞ;

ð15Þ

where (Z � hðSÞ) is called the innovation (or residual), K is calledthe innovation covariance and L is the gain. When the measurementequation is implicit as in (13), the above equations cannot be useddirectly any longer. Implicit equations arise in many sensing appli-cations. Soatto et al. [25] derived a solution for the Extended Kal-man Filtering with implicit measurements based on a first orderTaylor expansion of the implicit equation. Steffen [28] used a simi-lar expansion to derive an iterative filter which is essentially equiv-alent to linear least squares. Adopting a similar first orderlinearization technique, the following changes are to be made inorder to maintain the correctness of the filter:

1. The innovation is expressed as hðS;ZÞ instead of ðZ � hðSÞÞ.2. The effect of RðnzÞ on the innovation covariance is no longer addi-

tive but acts via the non-linear equation h. Therefore its contribu-tion to the innovation covariance can be approximated as

A. Fakih et al. / Computer Vision and Image Understanding 129 (2014) 1–14 5

Jðh;ZzÞRðnÞJðh;ZÞ0:

3. The gain L expression needs to be multiplied by �1 becauseZ and S are on the same side of the equation.

4. In fact, the covariance update equation in (15) is a reduced formof the following equation:

RðSþÞ ¼ CRðSÞC0 þ LRðnzÞL0: ð16Þ

In the implicit case, the reduction can no longer be done as inthis case RðnzÞ has to be replaced by Jðh;ZÞRðnzÞJðh;ZÞ0.

The modified equations can be applied to the dynamic systemin Eq. (13) to provide an estimate of St at each time step t.

3. F2F filtering methodology

Applying the IEKF filtering as presented in Section 2 inducestwo problems: first, because (10) is implicit, the procedure of add-ing F2F features is different for different filtering techniques. Sec-ond, the computational cost is too high for real-time processing.This paper proposes to solve these two problems by separatingthe filtering with the F2F features from the main filter using theLR features (Fig. 4). Such a scheme allows the use of IEKF filteringfor the F2F features regardless of the type of filtering used for theLR ones. Furthermore, it provides more potential for the reductionof the computational cost as will be shown later. In the remainderof the paper, the notations S1 and S2 will be used to represent thefollowing sub-vectors of S.

S ¼ ½S1; S2�;S1 ¼ ½Tw; Pw

i ; . . . ;PwN �; and

S2 ¼ ½qw;xw; Vw�:ð17Þ

The dimension of S1 is 3N þ 3 (3 parameters for the translation and3N for the structure). The dimension of S2 is 10. The covariancematrix RðSÞ would then consist of the following sub-matrices:

RðSÞ ¼RðS1Þ RðS1; S2ÞRðS2; S1Þ RðS2Þ

� �: ð18Þ

StPredict

St+1|tLR update

St+1

featuresLR

St = St+1

StPredict

St+1|t

LR updateSt+1

F2F updateS+

t+1

featuresLR

featuresF2F

St = S+t+1

Filtering usingLR features

Filtering usingLR + F2F features

Fig. 4. Augmenting SFM filters with an independent filtering stage using F2Ffeatures: the extra step does not interfere with the internal operation of the mainfilter. It only takes the output of the main filter between two iterations and modifiesit to account for the F2F features information.

Separating the LR filtering from the F2F filtering (Fig. 4) relievesus from any concerns regarding the nature or mode of operation ofthe LR filter. The only assumption made about the LR filter is that itshould output at every time step a mean vector and a covariancematrix for the state vector. No restriction is even placed on theorder of the parameters in the state vector or on the representationof the rotations in it. The first step of the proposed F2F filter,re-orders the 3D parameters and their covariance matrix to ensurethey are in the form ½S1; S2� described above.

3.1. Dealing with F2F outliers

The outliers in the LR filtering are assumed to be the responsi-bility of the LR filter. For the F2F features, the outliers are detectedusing the estimates of V c and xc and their covariance matrix thatcan be derived from the output of the LR filter using Eq. (3). Foreach feature, the error expression (8) and its variance are evaluatedusing the mentioned estimates of V c and xc and their covariancematrix. If the error is greater than 1.5 times the variance, the fea-ture is considered to be an outlier and not used in the filtering.

3.2. Filtering equations

While St and RðStÞ represent the output of the main filter attime t, the output of the extra F2F filtering step will be superscript-ed with a þ. The implicit filtering is applied only to the F2F mea-surement equations (Eq. (10)) and the update equations can bewritten as follows:

K ¼ Jðh; SÞRðSÞJðh; SÞ0 þ Jðh;Zf 2f ÞRðnf 2f ÞJðh;Zf 2f Þ0;

L ¼ �RðSÞJðh; SÞ0K�1;

C ¼ IðNÞ � LJðh; SÞ;Sþ ¼ S þ LhðS;ZÞ; and

RðSþÞ ¼ CRðSÞC0 þ LRðnf 2f ÞL0:

ð19Þ

In the remainder of this paper the matrix Jðh;Zf 2f ÞRðnf 2f ÞJðh;Zf 2f Þ0

which represents the uncertainty in h due to the uncertainty inthe F2F features will be referred to by Rhðnf 2f Þ.

The expressions of Jðh; SÞ and Jðh;Zf 2f Þ are:

Jðh; SÞ ¼@h S;Zf 2f� �@S

;

Jðh;Zf 2f Þ ¼@h S;Zf 2f� �@ðZf 2f Þ

:

8>>>>><>>>>>:

ð20Þ

The matrix Jðh; SÞ has a dimensionality of K � ð3N þ 13Þ where K isthe number of F2F features. It consists of two parts: the first onewith size K � ð3N þ 3Þ, corresponds to the Jacobian of h with respectto S1 and is all zeros; the second part is the Jacobian matrix of hwith respect to S2 and which is a K � 10 matrix. Therefore only thisnon-zero part needs to be computed and stored and its derivation isdone using the chain rule:

Jðh; S2Þ ¼@h@S2¼ @h@ð½V c; xc�Þ

@ð½V c; xc�Þ@ð½qw; Vw; xw�Þ ; ð21Þ

where h is the vector of errors defined in (10). The individual Jaco-bians in the above formula are computed using Maple software. The

matrix Jðh;Zf 2f Þ is block-diagonal where every block is a 1� 4 rowvector containing the 4 elements of @h

@½xðtÞ;xðtþ1Þ�. The matrix

Rhðnf 2f Þ ¼ Jðh;Zf 2f ÞRðnf 2f ÞJðh;Zf 2f Þ0

is a K � K diagonal matrix

representing the uncertainty in h due to the uncertainty in Zf 2f .

Its computation can be done in OðKÞ time since both Jðh;Zf 2f Þ andRðnf 2f Þ are block-diagonal matrices, and only corresponding blocksare multiplied together.

Table 1Number of FLOPS involved in the direct evaluation of the update operations.

Operation Cost(FLOPS)

K 2ðKN2 þ K2NÞ � KN þ 9K

K�1 3K3 þ 125K2 þ 0:5K � 8L 2NK2 � NKC 2N2K � N2 þ NþRðSÞ 4N3 � 2N2 þ 2N2K þ NKþSðtÞ 2NKc0ðN;KÞ 3K3 þ 4N3 þ 6KN2 þ 4K2N þ KN þ 12:5K2 þ 9:5K � 3N2 þ N � 8

6 A. Fakih et al. / Computer Vision and Image Understanding 129 (2014) 1–14

4. Complexity reduction

The computational cost will be reported as the number of FLoat-ing-Point Operations (FLOPS). It is assumed that the computationsare performed on Pentium 4 or Pentium M generation of proces-sors on which a division is carried out in 8 FLOPS and the log oper-ation 20 FLOPS. Also, it is assumed that the LU method is used formatrix inversions. Under these assumptions, the inversion of aK � K matrix requires 3K3 þ 125K2 þ 0:5K � 8 FLOPS. Also in thissection, the notation N is used to represent the total number ofparameters in the state vector (i.e., for Nt LR features, the totalnumber of parameters in the state vector would beN ¼ 3Nt þ 13). The reported processing times are achieved on alinux machine with a Pentium M 1.6 GHz processor and 3 GB ofRAM.

The cost of directly evaluating (19), denoted c0ðN;KÞ, will beused as a baseline for comparison. The breakup of the total numberof FLOPS involved in evaluating (19) is shown in Table 1. Note thatL is computed using the same RðSðtÞÞJðh; SðtÞÞ0 that is determinedwhen evaluating K. The total cost c0ðN;KÞ is not only cubic in thenumber of added F2F features ðKÞ but also in the number of param-eters in the state vector N and its computational complexity isOðK3 þ N3 þ KN2 þ K2NÞ. Two aspects of the system (19) can beexploited to reduce c0ðN;KÞ:

1. The noise vectors in the F2F features are statistically indepen-dent which means that their covariance matrix is diagonal. Thisallows the use of matrix inversion identities to manipulate theupdate equations in such a way that most of the costly opera-tions such as matrix inversions and multiplications involvemostly diagonal matrices and matrices with lower dimensions(N instead of K). This leads to significant computational savingsespecially in the case where the state vector dimension is muchsmaller than the number of F2F features such as when perform-ing the update of the velocity vectors only.

2. The partial observability of the state vector given the F2Ffeatures: since the F2F features measurement equation involvesonly the part S2 of S, the Jacobian of h with respect to S is aK � N matrix with only the last 10 columns non-zero. Thisaspect can be capitalized on in one of two ways. (1) Algebrai-cally, which involves analytically evaluating all the implicationsof the zero blocks in the computations and skipping the explicit

Table 2Three different possibilities to reduce the computational complexity of the F2F filtering.

Features independence strategy

1 Procedure I (with N ¼ Nt þ 13Þ2 Procedure II

3 Procedure I (with N ¼ 10)a

a There is an abuse of notation here, as N represents the size of the state vector in the paN is equal to Nt þ 13.

evaluation of the pertinent operations. It also involves rearrang-ing the expressions in such a way that the effect of skipping thezero-blocks operations is maximized. This procedure will bereferred to as Zero-blocks Skipping. (2) Statistically, by updatingat first only the translational and rotational velocities to incor-porate the F2F information, then the rest of the state vector isupdated based on its correlation with the velocities.

The two considerations above can be therefore exploited inthree possible ways:

1. Algebraically manipulate the filtering equations to capitalize onthe statistical independence of the features – referred to subse-quently as Procedure I – followed by the zero-blocks skippingprocedure (Section 4.1).

2. Algebraically manipulate the equations to capitalize on thestatistical independence of the features while simultaneouslytaking into consideration the zero-blocks – referred to subse-quently as Procedure II – (Section 4.2).

3. Update in a first step, only the 10 observable elements (S2) ofthe state vector. This partial filtering can be performed very effi-ciently and in time linear in K using Procedure I with the specialcase of N = 10. Afterwards, propagate the update to the othernon-observable elements (S1) of the state vector (Section 4.3).

Table 2 recapitulates those three possibilities along with thecomputational complexity of each one. The approach based on par-tial filtering followed by update propagation possesses the lowestcomputational complexity and is thus adopted in this paper. Theremainder of this section starts with a description of Procedure Isince it is used in the adopted approach. Then the computationalperformance of Procedure II, whose description is provided inAppendix A for the interested reader, is discussed. Finally, thepartial filtering/update propagation approach is presented.

4.1. Procedure I

The method presented here is general enough to be applied toany Kalman filtering system where the dimension of the state vec-tor is smaller than the dimension of the measurement vector, andwhere the noise vectors in the elements of the state vector are sta-tistically independent. As the K F2F features are independentobservations, updating the state vector using the K features at onceis equivalent to performing K consecutive updates using one fea-ture at a time. Hence, the update procedure can be modified tobe linear in K. However, resorting to a series of single-featureupdates is not a good strategy for that purpose as the cost wouldstill be computationally intensive since it would involve anOðKN3Þ term. Nevertheless, as shown in the following, throughusing the inversion of sum of matrices identities and capitalizingon the fact that RhðZf 2f Þ is diagonal, the filtering can be performedwith a computational cost that is linear in K and that involves an11N3 term instead of KN3. Central to this reduction procedure isthe Sherman–Morrison–Woodbury identity [7]:

Partial observation strategy Complexity

Zero Blocks Skipping OðN3 þ KN2ÞOðKN2Þ

Partial filtering/update propagation OðN2 þ KÞ

rtial filtering i.e., 10 and not Nt þ 13. Nevertheless, in the corresponding complexity,

N

K

c0c2

Fig. 5. Ratio c0=c2 (speedup of Procedure I followed by Zero Blocks Skipping) fordifferent values of K and N. The speed-up is significant mostly when K � N.

A. Fakih et al. / Computer Vision and Image Understanding 129 (2014) 1–14 7

ðAþ UBVÞ�1 ¼ A�1 � A�1UðI�1 þ VA�1UÞ�1

VA�1; ð22Þ

where A;U;B and V all denote matrices of the correct size. Usingthis identity the expression of the matrix K�1 can be modified asfollows:

K�1 ¼ ðRhðZf 2f Þ þ Jðh; SÞRðSÞJðh; SÞ0Þ�1

¼ ½RhðZf 2f Þ��1� ½RhðZf 2f Þ�

�1Jðh; SÞ

IðNÞ þ RðSÞJðh; SÞ0½RhðZf 2f Þ��1

Jðh; SÞ� ��1

RðSÞJðh; SÞ0½RhðZf 2f Þ��1:

ð23Þ

Note that this inversion is useful only because the measurementsare independent and hence RhðZf 2f Þ is diagonal and can be invertedin only K divisions. L can hence be re-written as:

L ¼ �RðSÞJðh; SÞ0 ½RhðZf 2f Þ��1� ½RhðZf 2f Þ�

�1Jðh; SÞ

hIðNÞ þ RðSÞJðh; SÞ0½RhðZf 2f Þ�

�1Jðh; SÞ

� ��1

RðSÞJðh; SÞ0½RhðZf 2f Þ��1i¼ �RðSÞJðh; SÞ0½RhðZf 2f Þ�

�1

þ RðSÞJðh; SÞ0½RhðZf 2f Þ��1

Jðh; SÞ

IðNÞ þ RðSÞJðh; SÞ0½RhðZf 2f Þ��1

Jðh; SÞ� ��1

RðSÞJðh; SÞ0½RhðZf 2f Þ��1:

ð24Þ

Let G0 and G1 be the two matrices:

G0 ¼ RðSÞJðh; SÞ0½RhðZf 2f Þ��1; and

G1 ¼ RðSÞJðh; SÞ0½RhðZf 2f Þ��1

Jðh; SÞ ¼ G0Jðh; SÞ:ð25Þ

RðSÞJðh; SÞ0 is a multiplication of a N � N matrix by a N � K matrixand can be done in 2N2K � NK FLOPS. Multiplying the resulting

matrix by ½RhðZf 2f Þ��1

requires only NK FLOPS as ½RhðZf 2f Þ��1

is diag-

onal. The inversion of RhðZf 2f Þ requires 8K FLOPS. Therefore, G0 canbe evaluated in 2N2K þ 8K FLOPS. G1 can then be carried out in2N2K � N2 FLOPS.

With the above notations, L and LJðh; SÞ can be re-written as:

L ¼ ðG1ðIðNÞ þ G1Þ�1� IðNÞÞðG0Þ; and

LJðh; SÞ ¼ ðG1ðIðNÞ þ G1Þ�1 � IðNÞÞðG1Þ:ð26Þ

The reason for writing LJðh; SÞ this way is that it can be computedmuch faster than multiplying L by Jðh; SÞ. The numbers of FLOPSof each of the operations involved in the update before and afterconsidering the zero-blocks in the matrices are shown in Table 3.

To evaluate the computational speedup, the ratio c0ðN;KÞc2ðN;KÞ

is

evaluated for different values of N and K. Fig. 5 shows thecorresponding results. When K is high relative to N a considerablespeed-up is achieved. For the case of K = 500 and N = 88 (corre-sponding to Nt ¼ 25 LR features), the speed-up is 32:3 times and

Table 3Number of FLOPS involved in Procedure I.

Operation Cost(FLOPS)

G0 2N2K þ 8K

G1 2N2K � KNL 5N3 þ 11:5N2 þ 2N2K þ 2:5N � KN � 8C 2N3 � N2 þ NþRðSÞ 4N3 � 2N2 þ 2N2K þ NKþSðtÞ 2NKTotal c1ðN;KÞ ¼ 11N3 þ 8:5N2 þ 8N2K þ NK þ 8K þ 4:5N � 8

the computation takes about 165 ms. However the speed-up tapersoff quickly as the ratio K/N decreases. For example, for 50 LR fea-tures (N ¼ 163) and K ¼ 200 F2F features the speed-up is only3:2 times. The efficiency increases cubicly when K increases forconstant N. The increase is more pronounced when N is small, forexample for N ¼ 6 and K ¼ 200 the speed-up is about 188:36times.

4.2. Procedure II

Another way to exploit the zero-blocks in the Jacobian of h withrespect to S is to consider the effects of these zero-blocks simulta-neously while performing the algebraic manipulation of the updateequations that is carried out to capitalize on the featuresindependence aspect. The details of such procedure are providedin Appendix A and the reduced computational cost isc3ðN;KÞ ¼ 25N2 þ 2KN2 þ 94NK � 19N þ 800K þ 17177. For thecase of K = 500 and N = 88 (corresponding to Nt ¼ 25 LR features),the speed-up of Procedure II is 39:4 which is slightly higher thanthe speedup of c2. However, for 50 LR features (N ¼ 163) andK ¼ 200 F2F features, the speed-up is 6:86 which is more thantwice the speedup of c2.

4.3. Partial filtering/update propagation

Instead of skipping the operations involving zero-blocks, thissection introduces a totally different approach to capitalize onthe partial observation aspect. The F2F features only affect directlythe sub-vector S2 of S. The other parameters are only affectedthrough their covariance with S2. Therefore, S2 can be updated first,then the covariance matrix of S can be used to propagate theupdate to S1. The flowchart of the proposed filtering method is

Cost with Zero Blocks Skipping (FLOPS)

20NK þ 8K

20NK � 10N

3N3 þ 144N2 þ 2:5N � 8þ 20NK20NK � 10N

25N2 þ 11NK þ 2N2K þ NK2NK

c2ðN;KÞ ¼ 3N3 þ 169N2 þ 2N2K þ 94NK � 17:5N þ 8K � 8

S1S2

Filter S2using F2F S+

2

Magnitude

adjustmentS+

2

Update

propagation

S+1

S+2

featuresLR

Fig. 6. Flowchart of the extra filtering with partial update of the velocity followedby an update of the remaining parameters.

Table 4Number of FLOPS involved in the cost reduction based on partial filtering.

Filtering S2 11887þ 818KAdjusting the magnitude of Sþ2 1140Propagating the update to S1 11N2 þ 391N � 500Total c4 ¼ 11N2 þ 391N þ 818K þ 12527

N

K

c0c4

Fig. 7. Speed-up of the partial filtering/update propagation method for differentvalues of K and N. The reduced computational cost is hundreds of times smallerthan the original cost of directly performing the update.

Fig. 8. For a fixed number of LR features (39 features or N ¼ 130), and increasingthe number of F2F features from 100 to 2000, the plot shows the variation of thecost relative to its value at 100 F2F features. The addition of 1000 F2F featuresrequires only 3.47 times more computation than the addition of 100.

8 A. Fakih et al. / Computer Vision and Image Understanding 129 (2014) 1–14

shown in Fig. 6 and three steps involved are described in thefollowing.

4.3.1. Filtering S2 using the F2F featuresThe dimension of the state vector in this case (S2) is only 10.

Therefore, Procedure I presented in Section 4.1 is adopted as itresults in a cost linear in K and quadratic in the dimension of thestate vector which is just 10. This results in a filtered Sþ2 that canbe computed with a cost of 11887þ 818K FLOPS.

4.3.2. Adjusting the magnitude of Sþ2Since the F2F features do not provide an observation of the

magnitude of Vw, the F2F filtering should not change this magni-tude. Therefore, any offset that might occur to the magnitude ofVw during the update must be eliminated. Let q represent the ratioof the magnitudes of Vw and Vwþ:

q ¼ kVwk

kVwþk; ð27Þ

then, after the update, Vwþ is multiplied by q;RðVwÞ by q2 andRðVwþ; ½qwþ;xwþ�Þ by q. The total cost of this operation is 1140FLOPS.

4.3.3. Propagating the update to S1

After updating S2, the information acquired from the F2F fea-tures is propagated to S1. This is done using the covariance matrixRðSÞ which acts as a string probabilistically connecting S2 and S1,through the covariance RðS1; S2Þ. Schurr complement and condi-tional probability identities allow to determine Sþ1 and its covari-ance as follows (See Appendix E in [15]):

W ¼ RðS1; S2ÞRðS2Þ�1;

Sþ1 ¼ S1 þW½Sþ2 � S2�;RðSþ1 ; S

þ2 Þ ¼WRðSþ2 Þ; and

RðSþ1 Þ ¼ RðS1Þ �W ½RðS2Þ � RðSþ2 Þ�W0:

8>>>><>>>>:

ð28Þ

The cost of this update can be evaluated as 11N2 þ 391N � 500FLOPS given that RðS1; S2Þ is an N � 10 by 10 matrix.

The costs of the all the operations involved in the partial filter-ing/update propagation approach are shown in Table 4. The totalcost is linear in K and quadratic in N. The speed-up with respectto c0 is shown in Fig. 7. This figure shows that the speed-up is dra-matically larger than the methods presented in the previous sec-tions. For the case ðK ¼ 500; N ¼ 88Þ the speed-up is 909:43times. For the case of 200 F2F features and 50 LR featuresðK ¼ 200; N ¼ 163Þ, the speedup is 187:44 times and the time

required is about 7 ms. Another important thing about this methodis that starting from a moderate value of N (for Example130 whichis equivalent to 39 features) the main computational cost would bedue to the N terms. Hence, K can be increased significantly whileincurring a low increase in the total cost. Fig. 8 shows the increaseof the cost for N fixed to 130 while increasing K from 100 to 2000.The cost for including K ¼ 1000 F2F features is only 3.47 times lar-ger than the cost needed for 100.

5. Experimental study

The purpose of this section is threefold. Firstly, it aims to showthat F2F features can improve SFM estimation in real world situa-tions while taking care of outliers and such. Secondly, it illustrateshow easy and straightforward it is to augment an already existentfilter with an implementation of the proposed approach. And

ror

(deg

rees

)

A. Fakih et al. / Computer Vision and Image Understanding 129 (2014) 1–14 9

thirdly, it confirm the validity of the cost reduction results pre-sented in Section 4.3. Towards this, two sets of experiments areperformed. First a set of simulation experiments are conductedto validate that the proposed partial filtering approach does workand the linear cost. Then, the proposed approach is integratedwithin the SceneLib1.0 software, which is an implementation ofDavison et al.’s monocular SLAM system [2]. The improvementachieved is assessed on real data with ground truth camera andvia comparison against the results of KinectFusion [9] as imple-mented in PCL [23].

frames

Er

Fig. 9. Translation direction error in degrees. The first frames before convergenceare truncated. The introduction of the F2F features reduces the error by a factor of 2.

Err

or(d

egre

es)

5.1. Simulation results

This subsection provides simulation results that compare theresults of a straightforward implementation of an EKF SFM filter,with the results that can be obtained by the addition of the F2Fextra update step performed as in Section 4.3 to the same filter.The simulation data was created as follows. 50 3D points are gen-erated randomly within a cube of size 4 m3 centered at ð0; 0; 5 mÞ(the units are not really relevant but ‘‘meters’’ are used to betterillustrate the extent of the errors in the results). Then a sequenceof random motions are applied to a virtual camera in such a waythat it is always fixating at the centroid of the cloud. For everymotion in the sequence, zero-mean Gaussian noise with differentvariances is added to the projections of the 3D points on the corre-sponding camera frame to obtain the LR measurements. Also, atevery frame another random cloud of K points with the samedimensions is generated, and its projections on the last two frames(augmented with noise) are used to simulate the F2F features.

The results presented below correspond to the average of 50runs with different levels of noise and different LR and F2F data.On an Intel Pentium M 2.13 GHz with a C++ implementation, theaverage time required to perform the update given 200 F2F fea-tures is about 8 ms. For 900 features the average time is about32 ms and for 1700 features about 46 ms. This is in line with thetheoretical linear cost discussed earlier.

frames

Fig. 10. Translational velocity error in degrees. The velocity error is significantlyreduced when using the F2F features.

frames

RM

Ser

ror(

×10−5

)

Fig. 11. Rotational velocity error. Significant reduction of the error with the use ofF2F features.

5.1.1. 3D parameters errorsThe error in the translation and translational velocity is deter-

mined as the angle in degrees between the true and estimateddirections. The error in rotation is taken as RbR0 � Ið3Þ where bR isthe estimated motion and R is the true rotation. This error com-bines both the error in the axis of rotation and in the magnitudeof rotation. In the case where the two compared rotations areabout the same axis, this error represents the difference in radiansbetween the two rotations. That is why radians is used as a unit forthis error. The error in rotational velocity and 3D feature positionsis taken as the Root Mean Square (RMS) error between the true andestimated vectors. We also look at the re-projection errors of thelast estimate of the 3D features in all the previous images. Themean values of those errors for the 50 runs are shown in Figs. 9–14 respectively.

Both the rotation and translation errors are reduced by almosthalf when using the F2F features. For the translational and rota-tional velocities, not only are the errors in general smaller, butthe error profiles are smoother and exhibit fewer spikes. This isdue to the fact that with LR features only, the system infers the val-ues of the velocities from the differences in translation and rota-tion estimates between successive states. With F2F features, thesystem uses actual incremental measurements. Having bettervelocity estimates is very important for many applications suchas autonomous vehicles and others. The depth error as well is sig-nificantly reduced. The improvement in the 3D parameters accu-racy is reflected in a reduction of the re-projection error asshown in Fig. 14.

frames

Rot

atio

ner

ror(

radi

ans)

.

Fig. 12. Rotational error in radians. The introduction of the F2F features reduces theerror almost by half.

frames

RM

Ser

ror(

met

ers)

Fig. 13. Depth error (meters). The F2F features help in obtaining a more accuratedepth.

frames

Mea

nre

-pro

ject

ion

erro

r(pi

xels

)

Fig. 14. Mean re-projection error of the final estimate of the 3D points on all theprevious frames (pixels). The improvements in the 3D parameter estimates isreflected in a smaller re-projection error.

10 A. Fakih et al. / Computer Vision and Image Understanding 129 (2014) 1–14

5.2. Experiments on real data

In this section, the F2F update procedure as described in Sec-tion 4.3 is used to augment Davison et al.’s SceneLib1.0 Monoslamsystem [2]. The F2F filtering was coded in a separate package witha single interface that can be used as follows in any filter thatmaintains a state vector ‘‘V’’ and its covariance ‘‘Cov’’ (assumedto be as -or convertible to- float pointers).

F2FFilter � f2ff¼newF2FFilterðCx; Cy; fx; fy; width;height; maxnumf2ffÞ;

f2ff! filterðV; Cov; imageÞ:

The max_num_f2ff parameter is the maximum number of F2Ffeatures that the system should attempt to detect and matchbetween every two consecutive frames. Note that the number ofactual features used at each frame is always lower than thisbecause many of the features are pruned as outliers.

To use this F2F interface to add F2F filtering capabilities toSceneLib1.0, a little additional coding has been done in order tohave an extra button in the graphical interface to toggle the F2F fil-tering ON and OFF as shown in Fig. 15.

In a first set of experiments, test sequences from the RGB-DSLAM Dataset of the CVPR group at TUM [31] are used to evaluatethe improvement that the proposed approach can bring about tothe SceneLib1.0 system. This Dataset consists of sequences ofRGB and depth images captured with a handheld Kinect sensor.The ground-truth trajectory of the sensor was obtained from a highaccuracy motion capture system with eight high-speed trackingcameras (100 Hz). Static and indoor sequences from this Datasetswere selected. Fig. 15 shows a frame from one of the usedsequences inside the graphical interface of the modified Scene-Lib1.0 software.

The F2F features are detected with the FAST feature detector[22] and tracked using the opencv pyramidal implementation ofthe KLT tracker [1]. Davison et al.’s Monoslam system requires 4known world points to be used as anchor points to fix the worldreference frame. For each experiment performed, 4 anchor pointswere selected from the sequence used in that experiment. Thedepths of those points were determined from the Kinect depthview of the first image and then projected to the world coordinateframe using the ground-truth of the first camera pose. Davisonet al.’s system uses active feature matching and attempts to keepat least 12 LR features visible per frame. Whenever the numberof visible features drops below 10, new features are automaticallyinitialized using the Particle Filtering approach presented in [2].The total number of LR features was 20 on average.

In each experiment the original and the F2F augmented filterswere run on the corresponding sequence until the anchor pointsdisappear from the field of view. Each experiment spannedbetween 150 and 300 frames. On the mentioned computer system,for 100 F2F features, the feature detection and matching takesabout 10 ms. Out of the 100 features, the average number ofselected inliers was around 70. The computation of the Jacobianmatrices for the average case takes about 4 ms and the F2F filteringabout 3 ms. Therefore, the total cost incurred by the F2F augmen-tation is less than 20 ms.

The performance of the F2F augmentation exhibited many vari-ations. Table 5 shows the average and standard deviation of therotation and translation errors with the respect to the groundtruth. The rotation errors are computed as in Section 5.1.1. Figs. 16and 17 show the translation and rotation errors for a typical exper-iment. Fig. 18 shows the ground truth and estimated trajectories.

The following observations were noted. In almost all thesequences, the filtering with F2F features lead to more accurate

Fig. 15. SceneLib1.0 interface with the added button to toggle F2F filtering ON and OFF. A frame from the considered data set is shown during the estimation. The projectionsof the tracked LR features are displayed with their estimated covariance on the image. The estimated camera motion and the 3D positions of the LR features are displayed inthe world coordinate frame.

Table 5Average errors of 10 experiments on sequences from the RGB-D Slam DataSet.

LR LR + F2F

Translation error (cm) 19.7 16.3Translation error std 5.4 4.2rotation error (radians) 0.11 0.089Rotation error std 0.039 0.032

LRLR+F2F

Time

erro

rin

met

ers

Fig. 16. Translational errors in meters in a typical experiment on a sequence fromthe considered dataset.

LRLR+F2F

Time

erro

rin

radi

ans

Fig. 17. Rotation errors in radians for the camera orientation in a typicalexperiment on a sequence from the considered dataset.

A. Fakih et al. / Computer Vision and Image Understanding 129 (2014) 1–14 11

results than the original system. In rare cases, adding the F2F fea-tures resulted in slightly lower accuracy. This happens mostly inscenes where a significant portion of the frames in the sequenceexhibit the two-frames bas-relief ambiguity situation [16]. A wayto cope with this is to randomly divide the F2F data at each frameinto multiple sets and perform the partial filtering step Sec-tion 4.3.1 on each set individually. If the change in the camera

velocity across the different sets is above a certain threshold, theF2F filtering at the concerned frame is dropped.

Another important observation is related to the number of theF2F features employed. In general, a trend of increased accuracywith increased number of F2F features was observed. However,this trend was not very consistent. One explanation for this is thatthe KLT implementation employed ranks the matched features inorder of best matching score. Therefore, for a given frame, the firstK returned F2F features by this matcher have a higher likelihood ofbeing inliers and are ‘‘higher quality’’ matches than any further fea-tures obtained by this matcher.

5.2.1. Comparison using KinectFusionIn another set of experiments, KinectFusion [9] was used to

track the motion of a Kinect camera and reconstruct the scene froma set of RGB-D images. The resulting camera motions were used as

LRLR+F2F

Ground truth

Fig. 18. Estimated and ground truth camera trajectories in a typical experiment ona sequence from the considered dataset.

Fig. 19. Snapshot of the reconstructed scene from sequence 1 using KinectFusion.

LRLR+F2F

Time

erro

rin

met

ers

Fig. 20. Translation errors in meters for the first sequence with respect toKinectFusion estimates.

LRLR+F2F

Time

erro

rin

radi

ans

Fig. 21. Rotation errors in radians for the camera orientation for the first sequencewith respect to KinectFusion estimates.

LRLR+F2F

Time

erro

rin

met

ers

Fig. 22. Translation errors in meters for the second sequence with respect toKinectFusion estimates.

LRLR+F2F

Time

erro

rin

radi

ans

Fig. 23. Rotation errors in radians for the camera orientation for the secondsequence with respect to KinectFusion estimates.

12 A. Fakih et al. / Computer Vision and Image Understanding 129 (2014) 1–14

a pseudo-ground truth to further assess the effect of F2F filtering.Two sequences of 2100 frames each were reconstructed usingKinectFusion. Fig. 19 shows a snapshot of the reconstructed scenecorresponding to the first sequence.

A. Fakih et al. / Computer Vision and Image Understanding 129 (2014) 1–14 13

Figs. 20–23 show the errors of the Monoslam estimation, inboth LR only and LR + F2F modes, versus the estimates of KinectFusion for the two considered sequences. In the first sequence,the camera repeatedly moves away from the anchor points (knownfeatures) used in Monoslam and then comes back which explainsthe zigzag pattern in the error. In the second sequence, the camerasees the anchor points only at the beginning of the sequence andthen moves away from them which explains the increasing errorpattern. An important thing to notice in those plots is that, when-ever the LR error increases, the F2F augmented filter manages tokeep a significantly lower error. Similarly in the second sequence,error accumulates at a much slower rate with the F2F filtering.When the LR filter error is low, the estimates with and withoutF2F filtering are similar. This is a natural and expected behaviorbecause in this case, the F2F features do not carry a lot of extrainformation about the motion over what’s obtained from the LRfeatures.

The bottom line observation is that the increase in accuracyobtained, combined with the low computational cost and the easeof integration with different implementation, makes the use of theproposed approach well worth it in any analytic filter.

6. Conclusions and future work

This paper presented an approach to efficiently incorporate theinformation from F2F features in analytical SFM filters. Theapproach can be applied to any filter as long as it maintains a meanvector and a covariance matrix of the estimates. Furthermore, thefiltering is accomplished through an extra filtering step thatrequires virtually no coding change in the filter to which the F2Finformation is added. It is computationally economic with the abil-ity to accommodate hundreds of F2F features in real time. The pre-sented experimental results showed a significant increase in theaccuracy of the structure and motion estimates.

The presented approach, although designed for filter based SFM,can be also used to augment non-filter based approaches such asthe Parallel Tracking and Mapping (PTAM) approach [12] since thislatter performs a non-linear optimization and maintains a Gauss-ian distribution of the estimates.

The presented solution also extends naturally to any types ofF2F measurements other than point-wise features. For example,curves and edges can be used as long as one can formulate a mea-surement equation relating those F2F measurements to the cameravelocity. Determining what type of features is suitable for everysituation is a subject of future work.

Also another extension of this work is to consider features span-ning three frames instead of F2F features. In this case, an algebraicthree frames constraint such as the trifocal constraint is used. Amain advantage of using three frames is that the relative transla-tion magnitude would be observable.

Appendix A. Reduction based on simultaneously consideringthe partial observation and the features independence

Going back to the expression of L as determined in (24), let G3

and G4 represent the two matrices:

G3 ¼ ½RhðZf 2f Þ��1

Jðh; SÞ; and

G4 ¼ RðSÞJðh; SÞ0 ¼ ½RðS1; S2Þ; RðS2Þ�Jðh; S2Þ0:ðA:1Þ

G3 has both the same structure and dimension as Jðh; SÞ. Therefore,

only the non-zero block of it which is equal to ½RhðZf 2f Þ��1

Jðh; S2Þneeds to be computed. We refer to this part as G2. The computation

of G2 requires only 10K FLOPS since RhðZf 2f Þ�1

is K � K diagonal. The

computation of G4 can be carried out in 19NK FLOPS. With the

above notation, IðNÞ þ RðSÞJðh; SÞ0½RhðZf 2f Þ��1

Jðh; SÞ� ��1

in the

expression of L can be written as:

IðNÞ þ RðSÞJðh; SÞ0½RhðZf 2f Þ��1

Jðh; SÞÞ� ��1

¼ IðNÞ þ ½RðS1; S2Þ; RðS2Þ�Jðh; S2Þ0ðG3Þ� ��1

: ðA:2Þ

The idea here is to manipulate the equations so that costly opera-tions such as inversions and matrix multiplication are performedmostly on 10� 10 matrices and diagonal matrices. Using anothervariant of the Sherman–Morrison–Woodbury formula which wasintroduced in [6] and which expresses the inverse of a matrix ofthe form ðAþ UBVÞ as:

ðAþ UBVÞ�1 ¼ A�1 � A�1UðI þ BVA�1UÞ�1

BVA�1; ðA:3Þ

(A.2) becomes

IðNÞ þ ½RðS1; S2Þ; RðS2Þ�Jðh; S2Þ0ðG3Þ� ��1

¼ ½IðNÞ��1 � ½IðNÞ��1½RðS1; S2Þ; RðS2Þ�

Ið10Þ þ Jðh; S2Þ0ðG3Þ½IðNÞ��1½RðS1; S2Þ; RðS2Þ�� ��1

Jðh; S2Þ0ðG3Þ½IðNÞ��1

¼ IðNÞ � ½RðS1; S2Þ; RðS2Þ�

Ið10Þ þ Jðh; S2Þ0ðG3Þ½RðS1; S2Þ; RðS2Þ�� ��1

Jðh; S2Þ0ðG3Þ:

ðA:4Þ

Recalling that G3 has the same form as Jðh; SÞ0 with its non-zeroblock equal to ½RhðZf 2f Þ�

�1Jðh; S2Þ, the matrix G3½RðS1; S2Þ; RðS2Þ�,

referred to subsequently as G5, can be written in the form:

G5 ¼ G3½RðS1; S2Þ; RðS2Þ� ¼ ½RhðZf 2f Þ��1

Jðh; S2ÞRðS2Þ ¼ G2RðS2Þ:ðA:5Þ

G5 is a K � 10 matrix whose computation using G2 requires 190KFLOPS. Substituting G5 by its value in (A.4) and using G6 to representthe matrix Jðh; S2Þ0ðG5Þ which is a 10� 10 matrix that which can becomputed in 200K � 100 FLOPS, (A.4) becomes:

ðIðNÞ þ ½RðS1; S2Þ; RðS2Þ�Jðh; S2Þ0ðG3ÞÞ�1

¼ IðNÞ � ½RðS1; S2Þ; RðS2Þ� Ið10Þ þ G6� ��1

Jðh; S2Þ0ðG3Þ: ðA:6Þ

Now L can be written as:

L ¼ �G4 ½RhðZf 2f Þ��1� G3 IðNÞ � ½RðS1; S2Þ; RðS2Þ�ð

hðIð10Þ þ G6Þ

�1Jðh; S2Þ0ðG3Þ

�ðG4Þ½RhðZf 2f Þ�

�1i¼ �G4 ½RhðZf 2f Þ�

�1� G3 � G3½RðS1; S2Þ; RðS2Þ��h

ðIð10Þ þ G6Þ�1Jðh; S2Þ0ðG3Þ

�ðG4Þ½RhðZf 2f Þ�

�1i¼ �G4 ½RhðZf 2f Þ�

�1� G3 � G5ðIð10Þ þ G6Þ

�1�hJðh; S2Þ0ðG3Þ

�ðG4Þ½RhðZf 2f Þ�

�1i:

ðA:7Þ

Jðh; S2Þ0ðG3Þ can be expressed as:

Jðh; S2Þ0ðG3Þ ¼ ½010�ð3Nþ3Þ Jðh; S2Þ0ðG2Þ� ¼ ½010�ð3Nþ3Þ G7�; ðA:8Þ

with G7 ¼ Jðh; S2Þ0ðG2Þ which is a 10� 10 matrix and 010�ð3Nþ3Þ is azero matrix of size 10� ð3N þ 3Þ. G7 can be done in 200K þ 100

FLOPS. Therefore, G5ðIð10Þ þ G6Þ�1Jðh; S2Þ0ðG3Þ can be written as:

Table A.6Number of FLOPS involved in the cost reduction based on Procedure II.

Operation Cost(FLOPS)

G2 10K

G4 19NK

G5 190K

G6 200K þ 100

G7 200K þ 100

L 41NK � 10N þ 200K þ 16977LJðh; SÞ 20NK � 9NþRðSÞ 25N2 þ 12NK þ 2N2KþSðtÞ 2NKTotal c3ðN;KÞ ¼ 25N2 þ 2KN2 þ 94NK � 19N þ 800K þ 17177

14 A. Fakih et al. / Computer Vision and Image Understanding 129 (2014) 1–14

G5ðIð10Þ þ G6Þ�1Jðh; S2Þ0ðG3Þ ¼ ½010�ð3Nþ3Þ G5ðIð10Þ þ G6Þ�1ðG7Þ�;

ðA:9Þ

and since G3 ¼ ½010�ð3Nþ3Þ G2� then

L ¼ �G4 ½RhðZf 2f Þ��1� 010�ð3Nþ3Þ G2h�

�G5ðIð10Þ þ G6Þ�1ðG7Þ

iðG4Þ

½RhðZf 2f Þ��1�

; ðA:10Þ

and we have:

010�ð3Nþ3Þ G2 � G5ðIð10Þ þ G6Þ�1ðG7Þh i

ðG4Þ

¼ ½010�ð3Nþ3Þ G2 � G5ðIð10Þ þ G6Þ�1ðG7Þ�½RðS1; S2Þ; RðS2Þ�Jðh; S2Þ0

¼ G2 � G5ðIð10Þ þ G6Þ�1ðG7Þ

h iRðS2ÞJðh; S2Þ0;

ðA:11Þ

hence, L can be re-written as:

L ¼ �G4 ½RhðZf 2f Þ��1� G2 � G5ðIð10Þ þ G6Þ�1ðG7Þh i�

RðS2ÞJðh; S2Þ0½RhðZf 2f Þ��1�

:ðA:12Þ

Note that RðS2ÞJðh; S2Þ0½RhðZf 2f Þ��1

is equal to ðG5Þ0, and L can beexpressed finally as:

L ¼ �G4 ½RhðZf 2f Þ��1� G2 � G5ðIð10Þ þ G6Þ�1ðG7Þh i

ðG5Þ0� �

: ðA:13Þ

The sequence of operations to compute L is summarized as follows:

G2 ¼ ½RhðZf 2f Þ��1

Jðh; S2Þ;G4 ¼ ½RðS1; S2Þ; RðS2Þ�Jðh; S2Þ0;G5 ¼ G2RðS2Þ;G6 ¼ Jðh; S2Þ0ðG5Þ;G7 ¼ Jðh; S2Þ0ðG2Þ; and

L ¼ �G4 ½RhðZf 2f Þ��1� G2 � G5ðIð10Þ þ G6Þ

�1ðG7Þ

h iðG5Þ0

� �:

The computation of L requires 41NK � 10N þ 200K þ 16977FLOPS. Table A.6 shows the computational cost of all the operationsinvolved in the update.

References

[1] J.Y. Bouguet, Pyramidal implementation of the Lucas Kanade feature trackerdescription of the algorithm, Intel Corporation, Microprocessor Res. Labs 5(2001).

[2] A. Davison, I. Reid, N. Molton, O. Stasse, MonoSLAM: real-time single cameraSLAM, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2007) 1052–1067.

[3] E. Eade, T. Drummond, Scalable monocular SLAM, in: IEEE Conference onComputer Vision and Pattern Recognition (CVPR), vol. 1, 2006, pp. 469–476.

[4] E. Eade, T. Drummond, Monocular SLAM as a graph of coalesced observations,in: IEEE 11th International Conference on Computer Vision, 2007, pp. 1–8.

[5] M.A. Fischler, R. Bolles, Random sample consensus: a paradigm for modelfitting with applications to image analysis and automated cartography,Commun. ACM 24 (1981) 381–395.

[6] H. Henderson, S. Searle, On deriving the inverse of a sum of matrices, Siam Rev.23 (1981) 53–60.

[7] N. Higham, Accuracy and Stability of Numerical Algorithms, second ed., Societyfor Industrial and Applied Mathematics, 2002.

[8] S. Holmes, G. Klein, D. Murray, A square root unscented Kalman filter for visualmonoSLAM, in: IEEE International Conference on Robotics and Automation(ICRA), 2008, pp. 3710–3716.

[9] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S.Hodges, D. Freeman, A. Davison, A. Fitzgibbon, Kinectfusion: real-time 3Dreconstruction and interaction using a moving depth camera, in: Proceedingsof the 24th Annual ACM Symposium on User Interface Software andTechnology, UIST ’11, ACM, New York, NY, USA, 2011, pp. 559–568.

[10] A. Jepson, D.D. Heeger, Linear subspace methods for recovering translationaldirection, Spatial Vision Hum. Robots (1992) 39–62.

[11] K. Kanatani, Y. Sugaya, Unified computation of strict maximum likelihood forgeometric fitting, J. Math. Imaging Vision 38 (2010) 1–13.

[12] G. Klein, D. Murray, Parallel tracking and mapping for small ar workspaces, in:6th IEEE and ACM International Symposium on Mixed and Augmented Reality,2007, pp. 225–234.

[13] G. Klein, D. Murray, Improving the agility of keyframe-based SLAM, in:Computer Vision–ECCV 2008, Springer, 2008, pp. 802–815.

[14] N.M. Kwok, G. Dissanayake, Bearing-only SLAM in indoor environments usinga modified particle filter, in: Australasian Conference on Robotics andAutomation, 2003, pp. 1–8.

[15] T. Lefebvre, H. Bruyninckx, J. Schutter, D Kalman filtering for non-minimalmeasurement models, Nonlinear Kalman Filtering for Force-controlled RobotTasks, vol. 19, Springer-Verlag, 2005, pp. 223–226.

[16] S.J. Maybank, O.D. Faugeras, A theory of self-calibration of a moving camera,Int. J. Comput. Vision (IJCV) 8 (1992) 123–151.

[17] R. Newcombe, A. Davison, Live dense reconstruction with a single movingcamera, in: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2010, pp. 1498–1505.

[18] R.A. Newcombe, S. Lovegrove, A.J. Davison, DTAM: dense tracking andmapping in real-time, in: International Conference on Computer Vision(ICCV), vol. 1, 2011, pp. 2320–2327.

[19] J. Oliensis, Exact two-image structure from motion, IEEE Trans. Pattern Anal.Mach. Intell. 24 (2002) 1618–1633.

[20] M. Pupilli, A. Calway, Real-time camera tracking using a particle filter, in:British Machine Vision Conference (BMVC05), 2005, pp. 519–528.

[21] G. Qian, R. Chellappa, Structure from motion using sequential Monte Carlomethods, Int. J. Comput. Vision (IJCV) 59 (2004) 5–31.

[22] E. Rosten, T. Drummond, Machine learning for high-speed corner detection, in:Computer Vision – ECCV 2006, Springer, 2006, pp. 430–443.

[23] R.B. Rusu, S. Cousins, 3D is here: Point Cloud Library (PCL), in: IEEEInternational Conference on Robotics and Automation (ICRA), 2011,Shanghai, China.

[24] R. Sim, P. Elinas, J.J. Little, A study of the Rao-Blackwellised particle filter forefficient and accurate vision-based SLAM, Int. J. Comput. Vision 74 (2007)303–318.

[25] S. Soatto, R. Frezza, P. Perona, Motion estimation via dynamic vision, IEEETrans. Autom. Control 41 (1996) 393–413.

[26] S. Soatto, P. Perona, Recursive 3-D visual motion estimation using subspaceconstraints, Int. J. Comput. Vision (IJCV) 22 (1997) 235–259.

[27] S. Soatto, P. Perona, R. Frezza, G. Picci, Recursive motion and structureestimation with complete error characterization, in: IEEE Computer SocietyConference on Computer Vision and Pattern Recognition, 1993, pp. 428–433.

[28] R. Steffen, A Robust Iterative Kalman Filter Based On Implicit MeasurementEquations, Technical Report 8, University of Bonn, Department ofPhotogrammetry, Institute of Geodesy and Geoinformation, 2008.

[29] H. Strasdat, J. Montiel, A.J. Davison, Real-time monocular SLAM: Why filter? in:IEEE International Conference on Robotics and Automation (ICRA), 2010, pp.2657–2664.

[30] H. Strasdat, J.M.M. Montiel, A. Davison, Scale drift-aware large scale monocularSLAM, Proc. Robot.: Sci. Syst. 1 (2010) 4.

[31] J. Sturm, N. Engelhard, F. Endres, W. Burgard, D. Cremers, A benchmark for theevaluation of RGB-D SLAM systems, in: IEEE International Conference onIntelligent Robot Systems (IROS), 2012, pp. 573–580.

[32] P. Torr, D. Murray, The development and comparison of robust methods forestimating the Fundamental matrix, Int. J. Comput. Vision (IJCV) 24 (1997)271–300.

[33] Z. Zhang, Y. Shan, Incremental motion estimation through modified bundleadjustment, in: International Conference on Image Processing (ICIP), vol. 3,2003, pp. II – 343–346.


Recommended