IEEESIGNALPROCESSINGLETTERS,VOL.22,NO.9,SEPTEMBER2015 … · 2017. 8. 1. ·...

IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 9, SEPTEMBER 2015 1331

Visual Tracking via Locally StructuredGaussian Process Regression

Yao Sui and Li Zhang

Abstract—We propose a new target representation method,where the temporally obtained targets are jointly represented asa time series function by exploiting their spatially local structure.With this method, we propose a new tracking algorithm, wheretracking is formulated as a problem of Gaussian process regres-sion over the joint representation. Numerous experiments onvarious challenging video sequences demonstrate that our trackeroutperforms several other state-of-the-art trackers.Index Terms—Gaussian process regression, sparsity regulariza-

tion, target representation, visual tracking.

I. INTRODUCTION

T ARGET representation is critical to visual tracking. Agood representation describes the target robustly andcaptures the appearance changes accurately, leading to goodtracking performance. For decades, much success in developingtarget representations has been achieved [1]–[3]. However,many difficulties, e.g., illumination changes of the scenarios,occlusions, and cluttered background, are still the obstacles tothe development of target representation.Subspace and sparse representations are the two most pop-

ular representation methods in tracking. Subspace representa-tion [4]–[6] assumes that the temporally obtained targets residein a low-dimensional subspace, and the principal componentsanalysis method is used to evaluate the representation. Sub-space representation has been demonstrated to be effective insome challenging situations, e.g., illumination changes. This isattributed to the fact that the representation errors are assumedto be small and dense, which follow the independent and iden-tically distributed (i.i.d.) Gaussian with small variances. How-ever, it is unstable in the case of occlusions because the repre-sentation errors may be arbitrarily large.Sparse representation [7]–[13] describes the target as a linear

combination of a few online maintained templates. The partialocclusions are absorbed by the sparse representation errors. Theunderlying assumption of this paradigm is that the representa-tion errors obey the i.i.d. Laplace or other heavy tailed distri-butions. This allows that the errors can be arbitrarily large but

Manuscript received September 17, 2014; revised November 27, 2014; ac-cepted February 06, 2015. Date of publication February 10, 2015; date of cur-rent version February 19, 2015. This work was supported by the National Nat-ural Science Foundation of China under Grants 61172125 and 61132007. Theassociate editor coordinating the review of this manuscript and approving it forpublication was Prof. Qian Du.The authors are with the Department of Electronic Engineering, Tsinghua

University, Beijing, 100084, China (e-mail: [email protected];[email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/LSP.2015.2402313

sparse. Thus, sparse representation is intrinsically robust againstpartial occlusions. However, it is computationally expensive.There are extensive studies on speeding up this paradigm, e.g.,fast algorithm [14], multi-task learning [15], and compressivesensing [16].Motivated by the previous success, we propose an alternative

approach to improve the tracking performance. We develop anew sparsity-based representation method, which jointly repre-sents the temporally obtained target as a time series function byexploiting the spatially local structure among the obtained tar-gets. We also address our representation in a geometric expla-nation. With the representation, we model the temporally ob-tained targets by a locally structured Gaussian process. Thus,visual tracking is formulated as a problem of Gaussian processregression over the proposed representation. Numerous experi-ments on various challenging video sequences demonstrate thatour tracker outperforms several other state-of-the-art trackers.

II. LEARNING REPRESENTATIONGiven a vector and a matrix , from the perspective of

linear approximation(1)

the coefficients imply the contributions of ’s columns to thereconstruction of . Thus, can be used to indicate the relationsbetween and each column of . To localize the relations, weconstrain to be sparse, i.e., we use only a few most relatedcolumns of to represent . To make this approximation ro-bust against the possibly abnormal values in , we introduce anadditive error term that is also enforced to be sparse, i.e., weallow that the errors can be arbitrarily large but sparse. To thisend, we define the following function to represent over .Definition 1 (representation): Given a column vector, de-

noted by , and a set of column vectors, denoted by the matrix, the representation of over is

(2)

where evaluates the element-wise absolute values of a vector,and is obtained from

(3)

where denotes the representation errors, denotes the -thcolumn of , and is the weight parameter.Note that the sparsity imposed on in Eq. (3) indicates thatimplies the locally intimate relationship between and . It

should also be noted that it does not make sense that equalsto some column of , i.e., is used to represent itself. Thus,

1070-9908 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1332 IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 9, SEPTEMBER 2015

we constrain for . Similar to the work involving-minimization, e.g., [17], [18], [19], we empirically set

in this work. Many algorithms can be used to solve Eq. (3), e.g.,OMP [20] and LASSO [21]. To make the definition more under-standable, we explain it in the geometric viewpoint as follows.Proposition 1: The coefficient in Eq. (3) defines a quasi-

semi-metric1 [22] for the distance between and ,

(4)

Proposition 2: The function in Eq. (2) defines the dis-tance between a vector and a set of vectors ,

(5)

From the above two propositions, it can be seen that we evaluatethe similarity between and each column of in the sense ofthe distance defined in Eq. (4), and represent over by usingthe largest similarity (i.e., minimum distance).

III. TRACKING ALGORITHMOur tracking algorithm is conducted within particle filter

framework [23], [24]. The particle corresponds to an imagepatch on a frame, which is defined by the motion state variable

, where and denote the 2D position, and isthe scaling coefficient. During tracking, numerous particles aregenerated on each frame and their corresponding patches arecropped from the frame image according to the state variables.These patches are normalized to the same size and stackedinto column vectors, respectively. We call such a vector thecandidate. The candidate with the largest likelihood belongingto the target is determined as the target on each frame.As the appearance of the target varies, some previously ob-

tained targets are required to be maintained to properly capturethe appearance changes. We call these maintained targets thetarget templates. Thus, the previously obtained targets can bejointly represented over the target templates by the time seriesfunction

(6)

where denotes the target templates. Notethat according to Eq. (3), solving is actually equivalentto solving the following minimization

(7)

where denotes the coefficients, denotes the representationerrors, and for a matrix . Many effi-cient algorithms can be used to solve Eq. (7), e.g., inexact aug-mented Lagrange multiplier [25].Locally Structured Gaussian Process. From the approxi-

mation perspective, in Eq. (3), is approximated bywith the approximation errors that are assumed to follow thei.i.d. Gaussian with variance . Thus, weassume to be the observed representation with an addi-tive Gaussian noise ,

(8)

where is the noise-free representation of over .

1A quasi-semi-metric on is a function that satisfies thefollowing two axioms: 1) positivity: ; and 2) positive definiteness:

if and only if .

On the other hand, as shown in Proposition 1 and 2,indicates the similarity between and . In this viewpoint,we assume that follows an i.i.d. Gaussian, because thetarget templates are considered to be similar to each otherwhen the sparse noise, e.g., occlusions, is swallowed up by therepresentation errors .To this end, defines a Gaussian process

(9)

with the mean function and covariance function

(10)

where is one if and only if and zero otherwise,denotes the noise power, and is the scale parameter.From Eq. (7), we can see that adaptively exploits the

locally intimate relationship among the target templates, be-cause can be viewed to indicate the relation between the-th and the -th target templates and with high possibility to bezero. Thus, we say that the Gaussian process defined inEq. (9), which is also viewed as the noise-free version of theobserved representation, is locally structured.Likelihood Evaluation. With the defined Gaussian process,

given a candidate , we can estimate its noise-free representa-tion by using Gaussian process regression [26] over thetraining set where is used as the input vectors(covariates) and is used as the corresponding scalar out-puts (dependent variables). Meanwhile, we can also evaluate theobserved representation of this candidate using Defini-tion 1.For a well-represented candidate, we expect that it can be

jointly represented well with the target templates together.It indicates that the difference between the noise-free and theobserved representations of this candidate (i.e., themagnitude ofthe noise in Eq. (8)) should be small enough. Thus, we definethe likelihood of a candidate belonging to the target as

(11)

where is the candidate, and is the scale parameter.Fig. 1 shows the target locations in the cases of pose change

and occlusion, where a well- and two poorly-represented can-didates are analyzed. We can see that for the well-representedcandidate, its noise-free representation tends to be the same asits observed representation in both cases, leading to larger like-lihood value, but it is quite different for the two poorly-repre-sented candidates. Further, from Fig. 1(a), it can be seen that inthe case of pose change, the Gaussian process regression dili-gently responds to the different input variables (i.e., the candi-dates), while the observed representation seems a little “lazy”(less different) due to the locality of the relation coefficients .In contrast, from Fig. 1(b), it can be seen that in the case of oc-clusion, the observed representation actively fits the differentcandidates, while the noise-free representation tends to be sim-ilar due to the capability of the sparse representation errorsto swallow up the occlusions. Thus, we can see that our repre-sentation is accurate and robust in the both cases, leading to theeffective likelihood evaluation.

SUI AND ZHANG: VISUAL TRACKING VIA LOCALLY STRUCTURED GAUSSIAN PROCESS REGRESSION 1333

Fig. 1. Target locations in two cases. In each sub-figure, a well-representedand two poorly-represented candidates are marked in the left image as well asthe target templates, and their noise-free representations , observed rep-resentations and likelihood values are shown in the right image,respectively (a) in the case of pose change (b) in the case of occlusion.

Tracking Framework. On the -th frame, given all the tem-porally obtained targets , the motion state of the -thtarget, denoted by , is estimated by maximizing

(12)

where denotes the motion model. Then, a candidateis available and the distribution is updated by

(13)

where denotes the observation model. The target on the-th frame, denoted by , is found by

(14)

where denotes the set of all the candidates.The motion model is defined as a Gaussian distribution

, where the covariance is adiagonal matrix, whose diagonal entries denote the variancesof the 2D translation and scaling, respectively. The observationmodel is defined to be proportional to the likelihoodvalue , as shown in Eq. (11). Because the observationmodel is also involved in candidate (particle) samplingwithin particle filter framework, we set a relative small value( ) to in Eq. (11) to achieve a tradeoff between accuracyand speed. Finally, we depict the main steps of our trackingalgorithm in Algorithm 1.

Algorithm 1 Tracking Algorithm

Input: target templates and all the candidates on thecurrent frame.

Output: the target located on the current frame.1: Compute the observed representation of thetarget templates by using Eq. (7).2: for each candidate do3: Compute the noise-free representation byGaussian process regression over the training set

where is used as the input vectors and is used as thecorresponding scalar outputs, according to Eq. (8) and (9).4: Compute the observed representation byusing Definition 1.5: Compute the likelihood using Eq. (11).6: end7: Locate the target by using Eq. (14).8: Update the target templates .

Update Strategy. To accurately reflect the appearancechanges of the target, the target templates are required to beupdated dynamically. In this work, we refer to [18] to updatethese target templates. Different from [18] where the latesttarget is directly used as the new template, we replace a tem-plate by the reconstruction of the latest target , where is thesparse coefficient of the latest target over the target templates, which can be obtained from Eq. (3).

IV. EXPERIMENTSOur tracker is implemented in MATLAB on a PC with an

Intel Core 2.8 GHz processor. The average running speed is 3frames per second. The colorful pixels on each frame are con-verted to be gray scale and normalized to . 10 target tem-plates and 100 candidates are used. The corresponding patchesof the candidates and the target templates are normalized tothe size of pixels. In the motion model, we set

.We compare our tracker against other seven state-of-the-art

trackers, including IVT [4], TLD [27], [28], Struck [29], SCM[30], [31], MTT [15], [19], CT [32], [33], and LSST [34]. Thecompeting trackers are publicly provided by the authors. Theparameters of the competing trackers are tuned carefully to ob-tain their best results. Note that some competing trackers aresensitive to the parameters of their motion models, e.g., SCMand LSST. Thus, we tune the parameters of these trackers inde-pendently for each video sequence for their best results.Importantly, we also emphasize that all the parameters of our

tracker are fixed for all the experiments in this paper.Qualitative Evaluations. Fig. 2 shows the tracking results

obtained by our tracker and the seven competing trackers on therepresentative frames of the sixteen video sequences.Illumination Variations. On the video sequences car4, david,

davidNew, singer1 and skating1, the appearance of the targetdrastically changes due to the illumination variations of the sce-narios. Our tracker obtains good results because the Gaussianprocess regression estimates the noise-free representationsstably. Besides, the subspace representation-based tracker IVTalso performs well.Occlusions. On the video sequences caviar1, caviar2,

caviar3, thusl, thusy and walker, the target is occluded by theother similar or dissimilar objects. It can be seen that our trackerachieves good tracking results on these video sequences. This isattributed to: 1) the representation is robust against occlusions;and 2) the likelihood of the candidates belonging to the target isevaluated stably and accurately. The trackers based on sparsity,e.g., SCM, MTT, and LSST, also obtain comparable results.Out-of-Plane Rotations. On the video sequences david,

davidNew, girl and sylv, the target suffers from the abrupt

1334 IEEE SIGNAL PROCESSING LETTERS, VOL. 22, NO. 9, SEPTEMBER 2015

Fig. 2. Tracking results on representative frames of the sixteen experimentalvideo sequences. From left to right and top to bottom, the frames are from thevideo sequences car4, car11, caviar1, caviar2, caviar3, david, davidNew, foot-ball, girl, human1, singer1, skating1, sylv, thusl, thusy, and walker.

appearance changes caused by out-of-plane rotations. From thetracking results, it can be seen that our tracker is robust to thischallenging situation and achieves good performance. This isbecause the update strategy always updates the target templatesby using their linear combinations to ensure that the rotatedtarget can also be represented accurately.Background and Similar Objects Distractions. On the video

sequences car11, football, thusl and thusy, the target is dis-tracted by the background or the similar objects. It can be seenthat our tracker obtains good results. This benefits from thehigh accuracy of our representation and the effectiveness of thelikelihood function defined in Eq. (11).Quantitative Evaluations. The target locations in the sixteen

experimental video sequences are manually labelled and used asthe ground truth. To quantitatively evaluate the performance ofour tracker, the average tracking location errors (TLE) and thesuccess rates (SR) on the sixteen experimental video sequencesare reported and shown in Tables I and II, respectively. The SRis computed by the criterion: the tracking result on a frame issuccessful, if , where and denote theareas of the bounding boxes of the tracking and ground truthresults, respectively. Overall, it can be seen that our tracker out-performs the other seven state-of-the-art trackers on the sixteenexperimental video sequences.Investigation on Parameter. We investigate the key

parameter in Eq. (6). Because a large value mayincrease the computational cost, we select the range

. The precision and success rateplots with respect to different values on the sixteen experi-mental video sequences are shown in Fig. 3. The precision isdefined as the percentage of frames where the tracking location

TABLE IAVERAGE TRACKING LOCATION ERRORS (IN PIXEL) OF THE EIGHT TRACKERS

OVER THE SIXTEEN EXPERIMENTAL VIDEO SEQUENCES. THE BESTRESULTS ARE SHOWN IN BOLD-FACE FONT

TABLE IISUCCESS RATES OF THE EIGHT TRACKERS ON THE SIXTEEN VIDEO SEQUENCES.

THE BEST RESULTS ARE SHOWN IN BOLD-FACE FONT

Fig. 3. Precision and success rate plots on the sixteen experimental video se-quences with respect to different values of in Eq. (6).

errors are less than a threshold. Instead of using the fixedthreshold 0.5, we plot the success rates with respect to differentthresholds. From the results, it can be seen that is thebest choice within the investigated range.

V. CONCLUSION

We have proposed a new target representation method, wherethe temporally obtained targets are jointly represented as a timeseries function by exploiting their spatially local structure. Withthis method, we have proposed a new tracking algorithm via alocally structured Gaussian process regression. A large numberof experiments have been conducted. Both the qualitative andquantitative evaluations have demonstrated that our tracker out-performs several other state-of-the-art trackers.

SUI AND ZHANG: VISUAL TRACKING VIA LOCALLY STRUCTURED GAUSSIAN PROCESS REGRESSION 1335

REFERENCES[1] A. Yilmaz, O. Javed, and M. Shah, “Object tracking,” ACM Comput.

Surv., vol. 38, no. 4, pp. 13–57, Dec. 2006.[2] Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A bench-

mark,” in IEEE Computer Soc. Conf. Computer Vision and PatternRecognition (CVPR), 2013, pp. 2411–2418.

[3] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. De-hghan, and M. Shah, “Visual tracking: An experimental survey,” IEEETrans. Patt. Anal. Mach. Intell., vol. 36, no. 7, pp. 1442–1468, Nov.2014.

[4] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental learningfor robust visual tracking,” Int. J. Comput. Vis., vol. 77, no. 1–3, pp.125–141, Aug. 2007.

[5] J. Kwon and K. Lee, “Visual tracking decomposition,” in IEEE Com-puter Soc. Conf. Computer Vision and Pattern Recognition (CVPR),2010, pp. 1269–1276.

[6] D. Wang and H. Lu, “Object tracking via 2DPCA and L1-regulariza-tion,” IEEE Signal Process. Lett., vol. 19, no. 11, pp. 711–714, 2012.

[7] X.Mei andH. Ling, “Robust visual tracking using L1minimization,” inIEEE Computer Soc. Conf. Computer Vision and Pattern Recognition(CVPR), 2009, pp. 1436–1443.

[8] B. Liu, L. Yang, J. Huang, and P. Meer, “Robust and fast collaborativetracking with two stage sparse optimization,” in Eur. Conf. ComputerVision (ECCV), 2010, pp. 1–14.

[9] X. Mei, H. Ling, and Y. Wu, “Minimum error bounded efficient l1tracker with occlusion detection,” in IEEE Computer Soc. Conf. Com-puter Vision and Pattern Recognition (CVPR), 2011, pp. 1257–1264.

[10] X.Mei, H. Ling, Y.Wu, E. P. Blasch, S. Member, and L. Bai, “Efficientminimum error bounded particle resampling L1 tracker with occlusiondetection,” IEEE Trans. Image Process., vol. 22, no. 7, pp. 2661–2675,2013.

[11] B. Liu, J. Huang, L. Yang, and C. Kulikowsk, “Robust tracking usinglocal sparse appearance model and K-selection,” in IEEE ComputerSoc. Conf. Computer Vision and Pattern Recognition (CVPR), Jun.2011, pp. 1313–1320, IEEE.

[12] X. Jia, H. Lu, and M.-H. Yang, “Visual tracking via adaptive structurallocal sparse appearance model,” in IEEE Computer Soc. Conf. Com-puter Vision and Pattern Recognition (CVPR), 2012, pp. 1822–1829.

[13] D. Wang, H. Lu, and C. Bo, “Online visual tracking via two viewsparse representation,” IEEE Signal Process. Lett., vol. 21, no. 9, pp.1031–1034, 2014.

[14] C. Bao, Y. Wu, H. Ling, and H. Ji, “Real time robust L1 tracker usingaccelerated proximal gradient approach,” in IEEE Computer Soc.Conf. Computer Vision and Pattern Recognition (CVPR), Jun. 2012,pp. 1830–1837, Ieee.

[15] T. Zhang, B. Ghanem, and S. Liu, “Robust visual tracking via multi-task sparse learning,” in IEEE Computer Soc. Conf. Computer Visionand Pattern Recognition (CVPR), 2012, pp. 2042–2049.

[16] H. Li, C. Shen, and Q. Shi, “Real-time visual tracking using compres-sive sensing,” in IEEE Computer Soc. Conf. Computer Vision and Pat-tern Recognition (CVPR), 2011, pp. 1305–1312.

[17] J. Wright, A. Yang, and A. Ganesh, “Robust face recognition via sparserepresentation,” IEEE Trans. Patt. Anal. Mach. Intell., vol. 31, no. 2,pp. 210–227, 2009.

[18] X. Mei and H. Ling, “Robust visual tracking and vehicle classificationvia sparse representation,” IEEE Trans. Patt. Anal. Mach. Intell., vol.33, no. 11, pp. 2259–2272, Nov. 2011.

[19] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual trackingvia structured multi-task sparse learning,” Int. J. Comput. Vis., vol. 101,pp. 367–383, 2013.

[20] Y. Pati, “Orthogonal matching pursuit: Recursive function approxima-tion with applications to wavelet decomposition,” in Asilomar Conf.Signals, Systems and Computers, 1993, pp. 40–44.

[21] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J.Roy. Statist. Soc. B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.

[22] V. Buldygin and Y. Kozachenko, Metric Characterization of RandomVariables and Random Processes. Providence, RI, USA: AMS ,2000.

[23] M. Isard, “CONDENSATION - Conditional density propagation forvisual tracking,” Int. J. Comput. Vis., vol. 29, no. 1, pp. 5–28, 1998.

[24] M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorialon particle filters for online nonlinear/non-gaussian bayesian tracking,”IEEE Trans. Signal Process., vol. 50, no. 2, pp. 174–188, 2002.

[25] Z. Lin, M. Chen, and Y. Ma, The augmented lagrange multipliermethod for exact recovery of corrupted low-rank matrices UIUC Tech.Rep., 2010.

[26] C. Rasmussen and C. Williams, Gaussian Processes for MachineLearning. Cambridge, MA, USA: MIT Press, 2006.

[27] Z. Kalal, J. Matas, and K. Mikolajczyk, “P-N learning: Bootstrappingbinary classifiers by structural constraints,” in IEEE Computer Soc.Conf. Computer Vision and Pattern Recognition (CVPR), Jun. 2010,pp. 49–56, IEEE.

[28] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,”IEEE Trans. Patt. Anal. Mach. Intell., vol. 34, no. 7, pp. 1409–1422,2012.

[29] S. Hare, A. Saffari, and P. Torr, “Struck: Structured output trackingwith kernels,” in IEEE Int. Conf. Computer Vision (ICCV), 2011, pp.263–270.

[30] W. Zhong, H. Lu, and M.-H. Yang, “Robust object tracking via spar-sity-based collaborative model,” in IEEE Computer Soc. Conf. Com-puter Vision and Pattern Recognition (CVPR), 2012, pp. 1838–1845.

[31] W. Zhong, H. Lu, and M.-H. Yang, “Robust object tracking via sparsecollaborative appearance model,” IEEE Trans. Image Process., vol. 23,no. 5, pp. 2356–68, May 2014.

[32] K. Zhang, L. Zhang, and M.-H. Yang, “Real-time compressivetracking,” in Eur. Conf. Computer Vision (ECCV), 2012, pp. 866–879.

[33] K. Zhang, L. Zhang, and M.-H. Yang, “Fast compressive tracking,”IEEE Trans. Patt. Anal. Mach. Intell., 2014, p. to appear.

[34] D. Wang, H. Lu, and M.-H. Yang, “Least soft-thresold squarestracking,” in IEEE Computer Soc. Conf. Computer Vision and PatternRecognition (CVPR), 2013, pp. 2371–2378.

Date post:	04-Feb-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

IEEESIGNALPROCESSINGLETTERS,VOL.22,NO.9,SEPTEMBER2015 … · 2017. 8. 1. ·...

Documents