Author's personal copy On pedestrian detection and tracking in infrared videos

(This is a sample cover image for this issue. The actual cover is not yet available at this time.)

This article appeared in a journal published by Elsevier. The attachedcopy is furnished to the author for internal non-commercial researchand education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling orlicensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of thearticle (e.g. in Word or Tex form) to their personal website orinstitutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies areencouraged to visit:

http://www.elsevier.com/copyright

http://www.elsevier.com/copyright

Author's personal copy

On pedestrian detection and tracking in infrared videos

Jiang-tao Wang a,⇑, De-bao Chen a, Hai-yan Chen b, Jing-yu Yang b

a Huaibei Normal University, Huaibei 235000, Chinab Nanjing University of Science and Technology, Nanjing 210094, China

a r t i c l e i n f o

Article history:Received 14 December 2009Available online 30 December 2011Communicated by A. Shokoufandeh

Keywords:Infrared pedestrian detectionInfrared pedestrian trackingMulti-cue fusionParticle filter

a b s t r a c t

This article presents an approach for pedestrian detection and tracking from infrared imagery. The GMMbackground model is first deployed to separate the foreground candidates from background, then a shapedescriber is introduced to construct the feature vector for pedestrian candidates, and a SVM classifier istrained based on datasets generated from infrared images or manually. After detecting the pedestrianbased on the SVM classifier, a multi-cues fusing algorithm is provided to facilitate the task of pedestriantracking using both edge feature and intensity feature under the particle filter framework. Experimentalresults with various Infrared Video Database are reported to demonstrate the accuracy and robustness ofour algorithm.

� 2011 Elsevier B.V. All rights reserved.

1. Introduction

Pedestrian detection and tracking in video sequences is one ofthe main issues of computer vision. It is useful for many vision-based applications including visual surveillance, human–computerinterfaces, traffic monitoring systems, video compression andmany more. Unfortunately pedestrian detection is a challengingtask due to the non-rigidity of human body, and in the last decade,there have been many different approaches to solve this problem,like the use of monocular (Enzweiler and Gavrila, 2009), stereo vi-sion (Bajracharya et al., 2009). Dai et al. (2007) presents a persondetection method based on both shape and appearance cue, theyfirst introduce a layered representation technique to separate theimage into background and foreground layer, then the detectionis solved as a multi scales template match problem base on theshape and appearance cues, this approach lead to high computa-tion cost. Bertozzi et al. (2007) presents a stereo system for thedetection of pedestrians using far-infrared cameras, in those sys-tem three detection technique: warm area detection, edge-baseddetection, and disparity computation are exploited according todifferent environmental conditions, based on this, a final validationprocess is performed using head morphological and thermal char-acteristics to confine the detecting result. The limitation of this ap-proach is that the stereo vision based system has less observingspace than monocular vision based system. To solve the persondetection problem, Fang et al. (2004) first estimated the person

candidate location through a ‘‘Projection-based’’ horizontal 49 seg-mentation and a ‘‘Brightness/Bodyline-based’’ vertical segmenta-tion, then shape-independent features are applied to make aclassification among those candidates. This method is very flexible;however, the detecting performance heavily relies on the infraredimaging quality. With the exception of vision-based systems, somealso use laser scanners to retrieve a 3Dmap of the terrain and de-tect pedestrians or uses ultrasonic sensors to determine the reflec-tion of pedestrians.

To achieve robustness and to reduce uncertainty in object rack-ing, over the past few years tremendous research efforts have beendevoted to the enhancement of visual tracking performance bydesigning various tracking programs and making use of differenttracking features (Collins et al., 2005; Han and Davis, 2005; Yanget al., 2005; Comaniciu et al., 2000, 2003; Jepson et al., 2003). Greatprogress has been made as reported in the literatures.

However, pedestrian tracking still suffers from a lack of robust-ness due to dynamic changing of the human body and the existingenvironment. Though it is noted that the fusion of multiple cueswill lead to an increased reliability of the tracking system, mostof current tracking algorithms are based on single cue determineda priori and are, therefore, often limited to a particular environ-ment. In this case, the fixed feature is chose random, or, some-times, preliminary experiments are runned to determine whichfeature to use. To improve the robustness in tracking, multiple cuesbased methods has attracted the attention of researchers.

There are various features that can be used for representing ob-jects, such as color, depth, motion and texture. Since most of thecomputer vision problems are ill-posed, more features increasethe robustness of solutions. Up to now, a number of literatureshave been published about the fusion of multiple cues. The success

0167-8655/$ - see front matter � 2011 Elsevier B.V. All rights reserved.doi:10.1016/j.patrec.2011.12.011

⇑ Corresponding author. Address: Department of Physical and Electronic Infor-mation, Huaibei Normal University, Huaibei 235000, China. Tel.: +86 561 3805264;fax: +86 561 3803256.

E-mail address: [email protected] (J.-t. Wang).

Pattern Recognition Letters 33 (2012) 775–785

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier .com/locate /patrec


of multi-cue based tracking algorithm relies on two key issues: (1)what features are used, (2) how the cues are integrated. In thispaper, we focus on the second key problem. A straightforward ap-proach is to use all cues in parallel and treat them as equivalentchannels; this approach has been reported in (Li and Francois,2004) and (Wang and David, 2006). In their works, different cuesare combined directly through a likelihood manner; however, alimitation of this method is that it does not take account of thecue’s discriminative ability. Democratic Integration is an architec-ture that allows the tracking of objects through the fusion of multi-ple adaptive cues in a self-organized fashion. This method isproposed by Triesch and Malsburg in (Triesch and von derMalsburg, 2001). It is explored more deeply in (Spengler andSchiele, 2003) and (Shen et al., 2003). In this framework each cuecreates a 2-dimensional cue report, or saliency map, the cues fu-sion is carried out by resulting fused saliency map which is com-puted as a weighted sum of all the cue reports. DemocraticIntegration is extended by Spengler and Schiele (2003) to use aparticle filter for tracking multiple targets. Pe0rez et al. (2004) pres-ent a particle filter based visual tracker that fuses three cues: color,motion and sound. In their work, color cues are served as the mainvisual cue, and according to the scenario under consideration, colorcues are fused with either sound localization cues or motion activ-ity cues. A partitioned sampling technique is applied to combinedifferent cues, the particle resampling is not implemented on thewhole feature space but in each single feature space separately.This technique increases the efficiency of the particle filter. How-ever in their case only two cues can be used simultaneously, thisrestricts the flexible selection of cues and the extension of themethod. Differ from above-mentioned fusion method, Loy et al.(2002) fusing different cues based on relative entropy, or theKullback–Leibler distances between the cue’s PDE (Partial Differen-tial Equation) and the target’s PDE. Wu and Huang (2004) formu-lates the problem of integrating multiple cues for robust trackingas the probabilistic inference problem of a factorized graphicalmodel. To analyze this complex graphical model, a variationalmethod is taken to approximate the Bayesian inference. Interest-ingly, the analysis reveals a co-inference phenomenon of multiplemodalities, which illustrates the interactions among different cues,i.e., one cue can be inferred iteratively by the other cues. An effi-cient sequential Monte Carlo tracking algorithm is used to inte-grate multiple visual cues, in which the co-inference of differentmodalities is approximated.

In this paper, we concentrated on exploiting a pedestrian detec-tion and tracking system based on infrared monocular imagerygrabbed by a still infrared camera. Autonomous pedestrian detec-tion in infrared is different from pedestrian detection in visibleas the two types of images have different characteristics. In ourwork, an intensity based shape describer is introduced to constructthe pedestrian feature vectors. Based on this, the SVM classifier istrained to distinguish the pedestrian from other objects. Afterdetecting the pedestrian, a multi-cue based particle filter is em-ployed to track the target. To fuse multiple cues adaptively, the rel-ative discriminative coefficients (RDC) of different cues are definedand computed to measure the tracking ability of them. Each cue isweighted based on this measure, and then all the cues are fused bya weighted sum manner. In Fig. 1 a flow chart of the presentedinfrared pedestrian detection and tracking system is given.

The remainder of the paper is organized as follows. Section 2presents a pedestrian feature describer for the infrared pedestriandetection. Section 3 gives the detection experimental results onseveral datasets. Section 4 introduces a multi-cue adaptive fusionbased particle filter tracking frame. Experimental results for infra-red pedestrian tracking under various scenarios are shown in Sec-tion 5. Finally, we conclude the paper in Section 6 with a summary.

2. Pedestrian feature describer

There exist many feature image creating methods to get the hu-man feature, such as the IIR filter, optical flow, etc. In this work, thebinary silhouette image of human is used as the feature image. Inorder to segment the silhouette accurately, a Gaussian MixtureModel is adopt to extract and update the background. This modelwas first presented by Stauffer and Grimson (1999); it is an adap-tive online background mixture model that can robustly deal withlighting changes, repetitive motions, clutter, adding or removingobjects from the scene and slowly moving objects. Their motiva-tion was that a unimodal background model could not handle im-age acquisition noise, light change and multiple layers of motionfor a particular pixel at the same time. Thus, they used a mixtureof Gaussian distributions to represent each pixel in the model.Due these promising features, we implement and integrate thismodel in our pedestrian detection and tracking system. A detaileddescription of this method can be found in (Stauffer and Grimson,1999).

In this model, the value of an individual pixel (e.g. scalars for grayvalues or vectors for color images) over time is considered as a‘‘pixel process’’ and the recent history of each pixel is modeled bya mixture of K Gaussian distributions with weight wk (k = 1, . . ., K).The value of K is chosen depending on the available memory andcomputational power. In order to detect the type (foreground orbackground) of the new pixel, let the covariance matrix correspondto Gaussian distributions with weight w is r, and the K Gaussiandistributions are sorted by the value of w/r. Then the first Bdistributions are chosen as the background model, where

Feature Construction

SVM Classifier

Pedestrian Targets

Initialize Particles

Resample

Multi-cue Fusion

State Estimation

Intensity Cue Edge Cue

Pedestrian

Detection

Pedestrian

Tracking

Pedestrian Candidates

Initial Frame

GMM Model

Fig. 1. Flow chart of the proposed infrared pedestrian detection and trackingalgorithm.

776 J.-t. Wang et al. / Pattern Recognition Letters 33 (2012) 775–785


B ¼ arg minb

Xb

k¼1

wk > T

!ð1Þ

and T is the minimum portion of the pixel data that should beaccounted for by the background, b is the minimum number ofGaussian distributions that meets with Eq. (1). If a small value ischosen for T, the background is generally unimodal. In our methodwe find that when K is 3 and T is 0.72 the system has a goodperformance.

Images in the infrared (IR) domain convey a type of informationthat is very different from images in the visible spectrum. While inthe visible spectrum the image of an object depends on the amountof incident light on its surface and on how well the surface reflectsit, in the IR domain the image of an object is related to its temper-ature and the amount of heat it emits.

The most salient characteristic of infrared imaging is that unlikevisible imaging, it does not depend on lighting conditions butrather varies with temperature changes. Even in daytime outdoor

scenes, the independence on lighting conditions is advantageousfor image processing algorithms, as lighting conditions generallychange faster than temperature, especially in scenes with cloudsoccluding sunlight. So with the background modeling techniques,the background adaptation is generally less crucial in infrared thanin visible. Foreground region detection using background subtrac-tion is also easier in infrared, as there are no shadows in infraredscenes. However, shadow removal is one of the most important is-sues with background subtraction in the visible.

In the other hand, due to camera technology limitations, infra-red images generally have lower spatial resolution (fewer pixels)and less sensitivity than visible ones. This may limit the efficiencyof the classification step based on pedestrian silhouette/shape,especially with pedestrians far from the camera. Also, due to theselimitations, infrared images can not provide as much informationas visible ones about objects. This is further exacerbated by the factthat there is no spectral information in infrared images as there isonly one sensor, compared to the three-channel RGB sensorarrangement in visible cameras, and this may lead to some prob-lems in the tracking algorithms.

In spite of various human body may has various shapes, com-pared to no-pedestrian object the human body has its own commonfeature. In this work we introduce a human body model based onthe shape feature. The proposed model can be constructed in thefollowing way:

After the foreground has been extracted from the infraredimage, first a mathematical morphology operator is employed to

Fig. 2. The representing model of human shape.

Fig. 3. An illustration of separating multi-bodies using projecting curve.

Fig. 4. Human samples and non-human samples in the training dataset.

Table 1Performance of classifier with various dimensions input vectors.

Dimension 4 9 16 25 36 49

Classifying results forpedestrian samples (%)

38.5 74.5 97 68.5 65 39

Classifying results fornonpedestrian samples(%)

88.43 85.95 97.52 95.87 100 97.52

J.-t. Wang et al. / Pattern Recognition Letters 33 (2012) 775–785 777


reduce the noise, then the foreground connected region isextracted and fitted with a rectangle window. In the case of singlepedestrian this rectangle region can be served as the human bodycandidate window. For each candidate window, we divide it intom � n sub-rectangle region with equal size, and the sum of inten-sity S lie in each sub-window are calculated. The feature value ofeach sub-region is formed as S/A, here A is the area of each sub-region. In this way, each candidate window can be representedwith a m � n matrix V:

V ¼

v11 � � � v1n

..

. . .. ..

.

vm1 � � � vmn

6666666477777775;here mmn ¼

Sm;n

Am;nð2Þ

In this matrix, each element is the value of the refereed sub-region.As show in Fig. 2, the human candidate window is divided into5 � 5 sub-windows, Fig. 2(a) is the original image, Fig. 2(b) givesthe feature model. In Fig. 2(b) each pane is associated with a featureelement, the value of the feature element is belong to the value re-gion of [0,1], a pane is more lighted than others when the corre-sponding feature element with bigger value. Fig. 2(c) shows thetopological appearance of the human feature.

To differ the pedestrian from other objects (such as cars) moreobviously, a weight is introduced based on the fact that the scalesize of human body is always limited in a predefined range. Denotethe height and width of the human rectangle window as H and W,the weight can be defined as k ¼W=H, then the feature vector afterweighting becomes:

V� ¼ k

v11 � � � v1n

..

. . .. ..

.

vm1 � � � vmn

666664777775; k ¼W=H ð3Þ

In practice the interaction between different peoples is very com-mon, and these people’s body may merge into one connected regionin the image. In order to make the feature model usable in this case,a separation process is needed to divide the group of peoples intosingle ones. In spite of the occlusion between different people, thehuman header can still have an obvious character, and after an im-age vertical projection, the human header’s position usually corre-sponding to the peak of the projecting curve. Base on this, asshown in Fig. 3, we can separate the interacting people in the fol-lowing ways:

(1) Extracting the foreground from the infrared image.(2) Make a intensity projection along the vertical direction.

Fig. 5. Human detection results of some test images.

Table 2Experimental results of human detection test.

Image size Image number Number of pedestrian Detecting percentage Mean detecting timefor one image (s)

Test 1 240 � 320 230 460 98.5 0.054Test 2 240 � 320 200 200 100 0.0371Test 3 240 � 360 23 101 98.8 0.120Test4 240 � 360 1698 2650 98.2 0.0285



(3) Detecting the peak of the project curve, if there is only onepeak, then go to step (1), else go to step (4).

(4) Separating the foreground into various parts according thenumber of peak in the curve.

However, in experiments we find that the latter half of the pe-destrian image has a sparse distribution, so we only use the upperhalf of the candidate image to realize the projection curve.

3. Pedestrian detection results

When the feature model has been established based on abovealgorithm, a classifier is deployed to classify the candidates intoeither a human or a non-human class. SVM is a technique to trainclassifiers that is well founded in statistical learning theory. Wehave implemented the SVM classification method with RBF kernelon IR images (Davis, 2007, http://www.cse.ohio-state.edu/otcbvs-bench/), and used the foreground candidates image as trainingand testing samples. A number of such training samples are shownin Fig. 4, to improve the training ability we add a series sampleimages generated manually (Fig. 4(b)).

Before classification, we first determine the dimension of thefeature model by observing the classifying results when changingthe dimension of feature model with a sub-dataset. The classifyingresults with various dimension feature model are shown in Table 1,we can find that when the feature model with a dimension of4 � 4, the classifying results of both pedestrian samples and non-pedestrian sample are achieve a right percentage above 97%. Soin our experiment the dimension of feature model is set as 4 � 4.In the test, four image galleries are sleceted, image gallery for test1 is gotten from the Terravic Motion IR Database, and the sampleimages can be seen in Fig. 5(a) and (d). Image gallery for test2 isobtained form the OSU Color-Thermal Database; Fig. 5(b) and (e)show two sample images for it. Image gallery for test3 is the OSUThermal Pedestrian Database, Fig. 5(c) and (f) give some samplefor this gallery. Detail description for these three galleries can befound online (http://www.cse. ohio-state.edu/otcbvs-bench/). Im-age gallery for test4 is also an infrared sequences (Conaire et al.,2008), Fig. 5(g), (h) and (i) give some sample image for thissequence. All of these four sequences were captured under the con-dition that a fixed camera is used.

In Table 2 the detecting percentage rate is given for four test im-age galleries, in general a high percentage rate is achieved. In Fig. 5,some detection results are shown among the test data. Fig. 5(a)–(g)shows some right detection results for these four test data, inFig. 5(h) a car is rejected (area inside the red windows), in

Fig. 5(i) two people is miss detected for that these are congluti-nated vertically. We should point that the failure of detection is ap-peared when the pedestrian is heavily occluded or sitting. Thesolution of this problem is a future research direction. Comparewith detection method proposed by Dai et al. (2007), both ap-proach achieves high detection percentage, however our methodhas lower computational complexity.

4. Pedestrian tracking based on multi-cue fusion

In this section, we focus on the multi-cue based infrared pedes-trian tacking in a particle filter framework. Here, we should clarifythat a fixed camera is used in this paper, which is the hypothesisnecessary to use this technique.

Pedestrian tracking in IR imagery is often more difficult thanthat in visible imagery. Unlike visible imagery containing colorand texture cues, shape is arguably the only cue that can beexploited by tracking in IR imagery. When the camera distance islarge, shape discrepancy between two different people but withsimilar weight and height is small. The thermal signature of a per-son is constantly varying due to the walking motion. Moreover,when two pedestrians walk closely or pass by each other, the over-lapped shape of pedestrians experiences severe deformation,which makes tracking even more difficult.

To overcome the above difficulties, we propose a multi-cue fu-sion strategies that exploit both intensity and shape informationadaptively at the same time. Our ultimate goal is to develop a mul-ti-cue based particle filter technique which can combine variouscues online and adaptively, so to track object robustly even in com-plex dynamic environment. In our work, the relative discriminativecoefficients of different cues are computed to measure the trackingability of various cues. Each cue is weighted based on this measure,and then all the cues are fused in a weighted sum manner.

4.1. Particle filter review

As mentioned above, our method is based on the tracking frame-work of the particle filter (PF) (Gordon et al., 1993; Nummiaro et al.,2003; Isard and Blake, 1998). For completeness, we summarize themain idea behind the particle filter in this section.

Visual tracking is often posed as a bayesian sequential estima-tion problem. In this mean, Kalman filter and the extended Kalmanfilter have been highly popular in designing visual trackers. How-ever, in circumstances where heavy background clutter leads tomulti-model distributions, or the system is highly non-linear andthe densities are highly non-Gaussian, then these techniques often

Fig. 6. Physical distance for two particles.



fail because of its assumption of Gaussian models or linear dynamicsystem. Particle filter relaxes the Gaussian and linear modelingassumption of Kalman filter by employing samples to representprobability densities and invoking Monte Carlo techniques to sim-ulate density propagation.

Particle filters, also known as Sequential Monte Carlo methods(SMC), Bootstrap filters, Condensation trackers and so on, aresophisticated model estimation techniques based on simulationwithin a Bayesian framework. Denoting by Xt = {x0,x1, . . . ,xt},Yt = {y0,y1, . . . ,yt} respectively the hidden state and the observationuntil time t. According to Bayesian estimation theory the optimalestimate of xt is given by the posterior mean E[xt|Yt]. Given the pos-terior pdf p(xt�1|Yt�1)at time t � 1, then current posterior pdf canbe obtained by the following two steps:

Prediction step:

pðxtjY t�1Þ ¼Z

xt�1

pðxtjxt�1Þpðxt�1jYt�1Þdxt�1 ð4Þ

Update step:

pðxtjY tÞ ¼pðyt jxtÞpðxt jYt�1Þ

pðyt jYt�1Þð5Þ

In the prediction step, the prior p(xt�1|Yt�1) is propagated top(xt|Yt�1) through a system dynamical model p(xt|xt�1). The pre-dicted state is then corrected by an observation likelihood functionL(Yt|xt) in the update step. In order to avoid the integral in Eq. (4),the key idea of particle filtering is to approximate the posteriorpdf by a weighted sample set S = {(s(n), b(n))|n = 1, . . . ,N}. Eachsample s represents one hypothetical state of the object, with a

corresponding discrete sampling probability b, wherePn

n¼1bðnÞ ¼ 1.The mean state of an object is estimated at each time step by

EðSÞ ¼XN

n¼1

bðnÞsðnÞ ð6Þ

The sample weight corresponding to each sample is proportional tothe observation likelihood function, and we would give the observa-tion likelihood function in Section 4.3.

4.2. Target dynamics model

To track an object sequently in video, we initially choose a re-gion which defines the object. The shape of this region is fixed apriori through the definition of an ellipse or a rectangular box. Inour case we choose a rectangular box characterized by the statevector

x ¼ fx; y; x0; y0; sx; sy; s0x; s0yg ð7Þ

where x and y denote the centroid of the rectangle, x0and y

0are the

velocities of the centroid, sx and sy are the length of the sides, s0x ands0yare the velocities of sx and sy. A first order auto-regression (AR)dynamics is adopt on these parameters. This has the form

xt ¼ Axt�1 þ Bmt ð8Þ

where vt is a multivariate normal distribution and the matrices Adefines the deterministic component and B, the stochastic compo-nent, here both A and B are represented as an identity matrix. Itis straightforward to extend this model to second order if it isrequired to capture more complex dynamical motions. For the time

Fig. 7. Tracking a person with well imaging situation (from left to right they are frame 1539, 1635, 1735 and 1917).



being we use an ad-hoc model composed of three independent con-stant velocity dynamics on xt, yt and st with respective standarddeviations 1 pixel/frame, 1 pixel/frame, and 0.1 frame�1.

4.3. Feature selection and observation model

Choosing appropriate features plays important roles for robusttracking in the particle filter framework. Feature selection mainlylies on the availability of the object feature. For simple, in thiswork, we select two popular features: intensity and edge informa-tion as the representing feature. Our choice of features is based on

their feasibility for robust visual tracking and their adaptability toour probabilistic framework. All of the features models are basedon histograms for it have the property that they allow somechanges in the object appearance without changing the histogram.

4.3.1. Intensity cueIntensity cue is widely used in object tracking as it achieves

robustness against non-rigidity, rotation and partial occlusion. Inthe infrared imaging system, the radiation heat from a human bodyis generally higher than the static background, such as a house or aroad. Thus the human body often possesses higher gray values inthe infrared image than the background, which can be used for dis-tinguishing the target from background.

We take a histogram-based non-parametric intensity modelwhich is originally introduced by Comaniciu et al. (2003) for themean-shift based object tracking algorithm. We determine theintensity distribution inside a rectangle region R centered at posi-tion y to be denoted as pI = {pI(u)}, u = 1, . . . ,m, where m is the num-ber of bins. In our experiments, the histograms are typicallycalculated in intensity space using 16 bins. To increase the reliabil-ity of the intensity distribution, smaller weights are assigned to thepixels that are closer to the border of the target region by employ-ing a kernel weighting function. More specifically pI(u) can be ob-tained as

pIðuÞ ¼ CXn

i¼1

ky � xi

h

�� 2� �

d½bðxiÞ � u� ð9Þ

where d is the Kronecker delta function, k is a kernel function, forexample the Epanechnikov kernel, h is the bandwidth, and {xi}i=1,. . .,n

Fig. 8. Tracking a person in an infrared sequence (from left to right they are frame 2041, 2147, 2405 and 2715).

Fig. 9. RDC curves for intensity and edge cues.



are the locations of the pixels in that region. Normalization factorC ¼ 1=

Pni¼1kð y�xi

h

�� 2Þ ensures that the sum of the bins is one. Givena reference histogram qI defining the target model, and the histo-gram pI(y) extracted from a candidate state y, the similaritybetween them is evaluated based on the Bhattacharyya coefficient

db½pIðyÞ; qI� ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1�

Xm

u¼1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffipu

I ðyÞquI

qsð10Þ

Upon this, the corresponding weight can be defined as follows:

bc / likelihoodc ¼1ffiffiffiffiffiffiffi

2pp

rc

exp� d2b

2r2c

ð11Þ

where rc is the standard deviation which specifies the Gaussiannoise in the measurements.

4.3.2. Edge cueEdge feature has been proved be useful for modeling the shape

of the object to be tracked. Here we describe the edge using a his-togram based on the estimated gradient orientation. Let the targetbe represented by a rectangular region R, the locations of the pixelsin that region are {xi}i=1,. . .,n, the bandwidth of it is h, which definesthe radius of the kernel function. Then we can construct the gradi-ent orientation histogram by the same as Eq. (9):

pEðuÞ ¼ CXn

i¼1

ky � xi

h

�� 2� �

d½hðxiÞ � u�u ¼ 1; . . . ;m ð12Þ

where u is the histogram bin index, h(xi) is the gradient orientationof pixel xi, d is the Kronecker delta function, and k is the kernel func-tion to emphasize the spatial arrangement of the features in the im-age; C is the normalized constant can be gotten by the same asintensity cue.

In order to remove burden of noises, the gradient orientation isfiltered to include only gradients with magnitude m above a prede-fined threshold Th. In our case, Th is defined as follows:

Th ¼ w� maxðimÞ þ ð1�wÞ�meanðimÞ ð13Þ

So we have:

hði; jÞ ¼hði; jÞ; mði; jÞ > Th

0; otherwise

�ð14Þ

For each cue, given two histograms qE and pE which describe thetarget model of current frame and reference frame in this featurespace respectively, then use Bhattacharyya coefficient the distancebetween them can be computed as:

qðpE;qEÞ ¼Xm

u¼1

ffiffiffiffiffiffiffiffiffiffipu

EquE

pð15Þ

dbðpE;qEÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� qðpE;qEÞ

pð16Þ

Based on the Bhattacharyya distance, the weight for particles in thisfeature space can be defined as:

be / likelihoode ¼1ffiffiffiffiffiffiffi

2pp

re

exp� d2b

2r2e

ð17Þ

where re is the standard deviation which specifies the Gaussiannoise in the measurements.

After selecting the tracking feature and constructing the targetmodel, then we can use the particle filter tracking methoddescribed in (Nummiaro et al., 2003) to estimate the target statein single cue case.

4.4. Relative discriminative coefficient based multi-cue fusion

When we use multi-cue for infrared pedestrian tracking, weshould decide how to fuse these cues efficiently. Generally, therehave two choices for us: fusing different features with constantweights and fusing features with time-various weights accordingto the change of tracking states. In practice, different feature hasdifferent discriminant ability in spite of for the same target, andthe discriminant ability may change along the time. Thus constantweights may lead to poor tracking performance, so the time-various weighting idea is applied here. In the tracking process, wewant feature with better quality has bigger weight and vice versa.

The Bhattacharyya distance defined in (Bhattacharyya, 1943)shows the similarity of the feature distribution between particlesample and reference model. And it is a popular technique for mea-suring the feature’s performance in multi-cue based object tracking(Brasnett et al., 2007). In the single cue case, particles with smallerBhattacharyya distances approach the target more closely. How-ever, this is not tenable when multiple cues are applied due to differ-ent feature has different sensitivity to the target appearancechanges, and the Bhattacharyya distance can not describe the simi-larity between particles and target veritably. To evaluate the parti-cle’s quality straightforwardly regardless of the feature space itbelongs to, another distance which called practical distance is de-signed. Given two rectangles R1 and R2, which centered at (x1,y1)and (x2,y2), as shown in Fig. 6(a), represent the reference targetand sample particle (target candidate) respectively. We define thepractical distance as the superposition degree between them. Gen-erally, three factor should be considered, they are the center distancebetween two rectangles (Fig. 6(a)), the pose angle difference be-tween them (Fig. 6(b)) and the size (area) difference between them(Fig. 6(c)). To have a more explicit understanding, here we supposethe object do not perform rotary motion, then when compute thepractical distance we only need take account of two factors: the po-sition distance and the size difference between them. As shown inFig. 6, this can be accomplished through Eq. (18) as:

dpðR1;R2Þ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffid2

x þ d2y þw1jh2 � h1j þ h2jw1 �w2j

qð18Þ

where dx = x1 � x2 and dy = y1 � y2 are the center biases betweenthem, and w1, h1, w2, h2 are the width and height of these two rectan-gles respective. In the computing of practical distance, the featurespaces which particle belong to are not taken account of, so this dis-tance give the particle quality objectively for different feature spaces.

In order to estimate the tracking ability of various cues, we firstcompute the Bhattacharyya distances in all feature spaces and thepractical distance for each particle. Then in each feature space, werank all the particles in descending order according to the weightvalue for this cue, and the first L particles are selected as the train-ing samples to evaluate the tracking ability. For each particle in thetraining samples the performance ratio t of it in this feature spaceis computed as:

ti ¼di

bðp;RÞdi

pðp;RÞi ¼ 1; . . . ; L ð19Þ

And the mean performance ratio for this feature space ist ¼ 1

L

PLi¼1ti, t indicates the cue’s discriminant ability. To measure

the difference among these cues, we define the Relative Discrimina-tive Coefficient (RDC) of cue f as:

RDCf ¼tfPFi¼1 ti

f ¼ 1; . . . ; F ð20Þ

where F is the number of cues which we adopt. Our objective is tofuse multiple cues adaptively according to the current cue weight;this can be achieved by generating new particle weight as:



BðnÞ ¼YF

f¼1

ðbðnÞf ÞRDCf ð21Þ

Then the new weight is normalized to have a sum of 1:

BðnÞ ¼ BðnÞ=sumðBÞ ð22Þ

Based on Eq. (22) all cues are combined into one through fusing theparticle weight in different feature spaces. This new weight isembedded in the particle filter tracking framework described in(Nummiaro et al., 2003) to locate the pedestrian. At the start frame,the reference models are initialized manually, and each cue weightis also computed by evaluating the performance of training sam-ples. In the next coming frame, the initialized cue weights are usedto estimate the target state in the particle filter framework, andthen the reference models are updated by the currently estimatedtarget models, the cue weights are also updated based on the newreference models. Thus we update reference models and cueweights at n � 1 frame, then use it for the pedestrian tracking at n(n > 1) frames, by this operation, different cues can be weightedadaptively. The general algorithm is given as following.

4.4.1. InitializationFor l = 1, . . . ,N, generate samples fxðlÞ0 g from the prior distribu-

tion p(x0). And initializing different feature’s RDC fRDCðf Þ0 g forf = 1, . . . ,F. Set initial weights bl;f

0 ¼ 1=N.

For t = 1:TResample

Select N states (they can repeat) starting from fxðlÞt�1g,carrying out the following procedure:

(1) For each particle, calculate the cumulative probability

as c0t ¼ 0, cl

t ¼ cl�1t þ bl

t;(2) We generate a random number r e [0, 1], uniformly

distributed, and find the smallest j for which cjt > r:

(3) The selected state is ~xlt�1 ¼ xj

t�1Prediction

(1) Spread the states f~xit�1ji ¼ 1; . . . ;Ng to fxi

t ji ¼ 1; . . . ;Ngusing the state dynamic model.

(2) For each new state xit ;find their corresponding weights

in each feature space based on the likelihood models givenin Section 4.3.

(3) Normalize the weights as

bl;ft ¼

bl;ftN

PNh¼1bh;f

t ; forf ¼ 1; . . . ; F:Multi-cue fusion

Combine multiple cues through Blt ¼

QFf¼1ðb

ðl;f Þt Þ

RDCft�1 ; and

normalize it as bBlt ¼

Blt

N

PNi¼1Bi

t .State estimation

Xt ¼XN

l¼1

Bltx

lt

Update the RDC

Update fRDCðf Þt g for f = 1, . . . , F based on bX t .End

5. Infrared pedestrian tracking results

This section evaluates the performance of the proposed infraredpedestrian tracking approach on real-word video sequences ob-tained from the Object Tracking and Classification in and Beyondthe Visible Spectrum (OTCBVS) benchmark dataset (Davis, 2007,http://www.cse.ohio-state.edu/otcbvs-bench/). In the test, three

very different tracking scenarios are considered. These experi-ments include a comparison of our method and three other track-ers: two of them are particle filter trackers using single cue ofintensity and edge respectively, the third is the IVT (IncrementalVisual Tracking) tracker (Ross David et al., 2008) which usingintensity cue. We demonstrate how multi-cue tracking signifi-cantly outperforms tracking using any one single feature. All ofthe tracking results are obtained using 200 particles, and in track-ing the rotational motion is ignored. When we use multi-cue fortracking, 20 particles are selected as the training sample to weightdifferent cues. Implemented in the Matlab7.0 environment, ouralgorithm runs at 3 to 7 frames/s on an AMD 2.0 GHz CPU. Thetracking result is shown in the figures with a white boundingbox. We list the most difficult and representative three video se-quences in the following subsections.

5.1. Tracking a walking person under well imaging situation

The first video sequence shown in Fig. 7 is a person that walksin a community environment. In this video sequence the humanbody image is light enough to discriminate from the backgroundand the only challenge are the rapid shape changes of no-rigid hu-man body motion. The tracking results show that all of the fourtrackers can located the person, however, scale errors appear forthe intensity, edge and IVT trackers when the person moves aftersome distance. Comparing with single cue based tracker, our com-bined cues-based tracker can improve the tracking performance; itcan locate the person accurately both on position and scale.

5.2. Tracking a moving person with intensity changes

In this sequence a man walks from far to near toward the cam-era. Along the walking process, the intensity and shape for the pe-destrian region all changed greatly. Fig. 8 shows the trackingperformance of four trackers with intensity cue, edge cue and com-bined cues respectively. The particle filter with intensity cue(Fig. 8(a)) and IVT tracker (Fig. 8(c)) are distracted by backgroundregions with similar intensity, and loses the target quickly. Thoughthe particle filter with edge cue (Fig. 8(b)) can track the pedestriansuccessfully; it loses some scale information of the pedestrian.Comparing to the tracking results with single cues, the particle fil-ter with multiple cues which fused by our proposed method(Fig. 8(d)) perform much better and track the pedestrian both inposition and scale nicely.

Fig. 9 shows RDC curves for intensity and gradient cues whichcombined with the presented approach; it can be seen that theRDC of both cues vary along the tracking time, and in general,the RDC of gradient is bigger than the weight of intensity, whichimplies the gradient feature is more reliable than the intensity fea-ture. This is supported by the practice that pedestrian trackingwith gradient cues has more accurate results than with intensitycues.

5.3. Tracking a walking person disturbed heavily

In Fig. 10, another difficult tracking situation is tested. In thistest two man walks together from the left of the camera field tothe right. During the walking they are close with each other andin some frames the conglutination between them are taken place.The difficulty in this sequence lies in that the tracking target is dis-turbed heavily by the other person in some fames, in this case com-mon tracking techniques will be failed to find the correct target.Fig. 10 gives some tracking results of this test. Just as analyzedabove, when the feature model is single (Fig. 10(a), (b)), the track-ers are distracted by the disturbing person, and failed to track theright target finally. Though the IVT tracker can successfully find the



target (Fig. 10(c)), it loses some scale information. On the contrary,the tracker with adaptively combined cues can stably follow thecorrect pedestrian target throughout the sequence. As shown inFig. 10(d), our algorithm is robust to these difficult conditions.

6. Conclusion

This paper describes a shape based pedestrian detection ap-proach and a multi-cue fusion algorithm to improve the perfor-mance for particle filter based infrared pedestrian tracking. Toefficiently detect the pedestrians in infrared images, a novel fea-ture describer is introduced, and the SVM classifier is trained toget the pedestrian target. Based on this, a multi-cue fusion strate-gies is proposed for the pedestrian tracking task, in this fusionalgorithm both of the distance in feature space and the physicalspace for particles are considered and the relative discriminantcoefficients are defined to evaluate the tracking ability for differentcues. Comparing with pervious multi-cue fusing methods, themain strongpoint of our method lies in that, first in our method,different cue’s tracking ability can be measured distinctly, secondour method gives a tracking frame in which new cues can be addedor removed flexibly. As demonstrated by various video sequences,our algorithm can track objects robustly. In our future areas for re-search include the solution of occlusions problem and tracking ofmultiple infrared targets.

Acknowledgements

This work was supported by the National Science Foundation ofChina under Grants No. 60632050, Natural Science Foundation of

Anhui Province (10040606Q56), Provincial Natural Science Re-search Project of Anhui Colleges (KJ2010B185, KJ2011A252).

References

Bajracharya, M., Moghaddam, B., Howard, A., 2009. A fast stereo-based system fordetecting and tracking pedestrians from a moving vehicle. Internat. J. RoboticsRes. 28 (11–12), 1466–1485.

Bertozzi, M., Broggi, A., Caraffi, C., et al., 2007. Pedestrian detection by means of far-infrared stereo vision. Computer Image and Vision Understanding 106 (2–3),194–204.

Bhattacharyya, A., 1943. On a measure of divergence between two statisticalpopulations defined by their probability distributions. Bull. Calcutta Math. Soc.35, 99–110.

Brasnett, P., Mihaylova, L., Bull, D., 2007. Sequential Monte Carlo tracking by fusingmultiple cues in video sequences. Image Vision Comput. 25 (8), 1217–1227.

Conaire, C.O., O’Connor, N.E., Smeaton, A.F., 2008. Thermo-visual feature fusion forobject tracking using multiple spatiogram trackers. Machine Vision Appl. 19 (5–6), 483–494.

Collins, R., Liu, Y., Leordeanu, M., 2005. Online selection of discriminative trackingfeatures. IEEE Trans. Pattern Anal. Mach. Intell. 27 (10), 1631–1643.

Comaniciu, D., Ramesh, V., Meer, P., 2000. Real-time tracking of non-rigid objectsusing mean shift. In: Proc. 1st Conf. on Computer Vision and PatternRecognition, pp. 142–149.

Comaniciu, D., Ramesh, V., Meer, P., 2003. Kernel-based object tracking. IEEE Trans.Pattern Anal. Mach. Intell. 25 (5), 564–577.

Dai, C.X., Zheng, Y.F., Li, X., 2007. Pedestrian detection and tracking in infraredimagery using shape and appearance. Computer Image and VisionUnderstanding 106 (2–3), 288–299.

Davis J., 2007. OTCBVS benchmark dataset collection. http://www.cse.ohio-state.edu/otcbvs-bench/.

Enzweiler, M., Gavrila, D.M., 2009. Monocular pedestrian detection: Survey andexperiments. IEEE Trans. Pattern Anal. Machine Intell. 31 (12), 2179–2195.

Fang, Y.J., Yamada, K., Ninomiya, Y., et al., 2004. A shape-independent method forpedestrian detection with far-infrared images. IEEE Trans. Veh. Technol. 53 (6),1679–1697.

Gordon, N., Salmond, D., Smith, A., 1993. A novel approach to nonlinear/non-Gaussian Bayesian state estimation, IEEE Proceedings-F, 140, pp. 107–113.

Fig. 10. Tracking a person when disturbed by another one (from left to right they are frame 223, 293, 519 and 579).



Han, B., Davis, L. 2005. Robust observations for object tracking. In Proc. 12thInternat. Conf. on Image Processing. 2, pp. 442–445.

Isard, M., Blake, A., 1998. Condensation-conditional density propagation for visualtracking. Internat. J. Comput. Vision 28 (1), 5–28.

Jepson, A., Fleet, D., EI-Maraghi, T., 2003. Robust online appearance models forvisual tracking. IEEE Trans. Pattern Anal Machine Intell. 25 (10), 1296–1311.

Li, P., Francois, C., 2004. Image cues fusion for object tracking based on particlefilter. Lect. Notes Comput. Sci. 3179, 99–107.

Loy, G., Fletcher, L., Apostoloff, N. 2002. An adaptive fusion architecture for targettracking. In Proc. Fifth IEEE Internat. Conf. Automatic Face and GestureRecognition, pp. 261–266.

Nummiaro, K., Koller-Meier, E., Gool, L.V., 2003. An adaptive color-based particlefilter. Image Vision Comput. 21 (1), 99–110.

Pe0rez, P., Vermaak, J., Blake, A., 2004. Data fusion for tracking with particles. Proc.IEEE 92 (3), 495–513.

Ross David, A., Lim, J.W., Lin, R., et al., 2008. Incremental learning for robust visualtracking. Internat. J. Comput. Vision 77 (1–3), 125–141.

Shen, C., van den Hengel, A., Dick, A. 2003. Probabilistic multiple cue integration forparticle filter based tracking. In Proc. VIIth Digital Image Computing:Techniques and Applications. pp. 399–408.

Spengler, M., Schiele, B., 2003. Towards robust multi-cue integration for visualtracking. Machine Vision Appl. 14 (1), 50–58.

Stauffer, C., Grimson, W. 1999. Adaptive background mixture models for real-timetracking. In Proc. IEEE Computer Society Conf. on Computer Vision and PatternRecognition, 2, pp. 246–252.

Triesch, J., von der Malsburg, C., 2001. Democratic integration: Self-organizedintegration of adaptive cues. Neural Comput. 13 (9), 2049–2074.

Wang, H., David, S. 2006. Efficient visual tracking by probabilistic fusion of multiplecues. In Proc. 18th Internat. Conf. on Pattern Recognition. 4, pp. 892–895.

Wu, Y., Huang, T.S., 2004. Robust visual tracking by integrating multiple cues basedon co-inference learning. Internat. J. Comput. Vision 58 (1), 55–71.

Yang, C., Duraiswami, R., Davis, L. 2005. Efficient mean-shift tracking via a newsimilarity measure. In Proc. IEEE Computer Society Conf. on Computer Visionand Pattern Recognition, pp.176–183.


Date post:	20-Jan-2023
Category:	Documents
Upload:	independent
View:	0 times
Download:	0 times

Author's personal copy On pedestrian detection and tracking in infrared videos

Documents