+ All Categories
Home > Documents > 1096 IEEE TRANSACTIONS ON CIRCUITS AND...

1096 IEEE TRANSACTIONS ON CIRCUITS AND...

Date post: 29-Jun-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
10
1096 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008 Person Surveillance Using Visual and Infrared Imagery Stephen J. Krotosky and Mohan Manubhai Trivedi Abstract—This paper presents a methodology for analyzing multimodal and multiperspective systems for person surveillance. Using an experimental testbed consisting of two color and two infrared cameras, we can accurately register the color and in- frared imagery for any general scene configuration, expanding the scope of multispectral analysis beyond the specialized long-range surveillance experiments of previous approaches to more general scene configurations common to unimodal approaches. We design an algorithmic framework for detecting people in a scene that can be generalized to include color, infrared, and/or disparity features. Using a combination of a histogram of oriented gradient (HOG) feature-based support vector machine and size/depth-based constraints, we create a probabilistic score for evaluating the presence of people. Using this framework, we train person de- tectors using color stereo and infrared stereo features as well as tetravision-based detectors that combine the detector outputs from separately trained color stereo and infrared stereo-based detectors. Additionally, we incorporate the trifocal tensor in order to combine the color and infrared features in a unified detection framework and use these trained detectors for an experimental evaluation of video sequences captured with our designed testbed. Our evaluation definitively demonstrates the performance gains achievable when using the trifocal framework to combine color and infrared features in a unified framework. Both of the trifocal setups outperform their unimodal equivalents, as well as the tetravision-based analysis. Our experiments also demonstrate how the trained detector generalizes well to different scenes and can provide robust input to an additional tracking framework. I. INTRODUCTION T HIS paper presents a methodology for analyzing multi- modal and multiperspective systems for person surveil- lance. Using an experimental testbed consisting of two color and two infrared cameras, we can accurately register the color and infrared imagery for any general scene configuration, so the scope of multispectral analysis can be expanded beyond the specialized long-range surveillance experiments of previous ap- proaches to more general scene configurations common to uni- modal approaches. We design an algorithmic framework for detecting people in a scene that can be generalized to include color, infrared, and/or disparity features. Using a combination of histogram of oriented gradient (HOG) features from the color and infrared domains, Manuscript received October 28, 2007; revised March 9, 2008. First pub- lished July 9, 2008; current version published August 29, 2008. This work was supported by the Technical Support Working Group and the U.C. Discovery Grant. This paper was recommended by Guest Editor Z. He. S. J. Krotosky is with the Advanced Multimedia and Signal Processing Di- vision, Science Applications International Corporation (SAIC), San Diego, CA 92121 USA (e-mail: [email protected]). M. M. Trivedi is with the Computer Vision and Robotics Research Laboratory (CVRR), University of California at San Diego, La Jolla, CA 92093-0434 USA. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2008.928217 we train a support vector machine (SVM) to detect people in the scene. Additionally, we learn the relationship between person size and depth in the scene to create a disparity-based detector. We assume that the visual and disparity trained detectors can be treated independently and probabilistically combine their out- puts to create an overall detection score. Within this framework, we train person detectors using color stereo and infrared stereo features. We also analyze tetravision- based detectors that combine the detector outputs from sepa- rately trained color stereo and infrared stereo features. Addi- tionally, we incorporate the trifocal tensor in order to combine the color and infrared features in a unified detection framework, doing so for both the color stereo single infrared and infrared stereo single color cases. We use these trained detectors for an experimental evaluation of video sequences captured with our designed testbed. Our evaluation definitively demonstrates the performance gains achievable when using the trifocal framework to combine color and infrared features in a unified framework. Both of the trifocal setups outperform their unimodal equivalents, as well as the tetravision-based analysis. Our experiments also demon- strate how the trained detector generalizes well to different scenes and can provide robust input to an additional tracking framework. II. RELATED RESEARCH Person analysis in multispectral and multiperspective im- agery is a relatively new area of research in computer vision. Analysis that incorporates a comparison between color and infrared imagery for person analysis has been relatively sparse and limited in scope and generality. Typical studies have looked at person detection by treating color and infrared separately. For example, Zhang et al. [1] com- pared different image features in color and infrared monocular imagery for training a SVM. However, no direct comparison of the detection rates of color and thermal imagery was presented. Ran et al. [2] also looked at separately using color and thermal imagery to detect periodic motion to indicate pedestrians in the scene. The main goal of these studies is to show the extensibility and adaptation of color image analysis techniques on infrared imagery. Other studies have examined person detection as a fusion of color and infrared imagery. Davis and Sharma [3] have con- structed a data set of color and infrared videos. The data set pro- vides for a frame-by-frame comparison of the color and infrared imagery and allows for the registration of the two videos as the view conforms to a planar homography assumption. This data set has allowed for the development of algorithms that combine the color and thermal imagery for improved background sub- traction [4], [5] and person detection and tracking [6]. 1051-8215/$25.00 © 2008 IEEE
Transcript
Page 1: 1096 IEEE TRANSACTIONS ON CIRCUITS AND …swiftlet.ucsd.edu/publications/2008/Krotosky_PersonSurv...1096 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO.

1096 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008

Person Surveillance Using Visualand Infrared Imagery

Stephen J. Krotosky and Mohan Manubhai Trivedi

Abstract—This paper presents a methodology for analyzingmultimodal and multiperspective systems for person surveillance.Using an experimental testbed consisting of two color and twoinfrared cameras, we can accurately register the color and in-frared imagery for any general scene configuration, expanding thescope of multispectral analysis beyond the specialized long-rangesurveillance experiments of previous approaches to more generalscene configurations common to unimodal approaches. We designan algorithmic framework for detecting people in a scene that canbe generalized to include color, infrared, and/or disparity features.Using a combination of a histogram of oriented gradient (HOG)feature-based support vector machine and size/depth-basedconstraints, we create a probabilistic score for evaluating thepresence of people. Using this framework, we train person de-tectors using color stereo and infrared stereo features as well astetravision-based detectors that combine the detector outputsfrom separately trained color stereo and infrared stereo-baseddetectors. Additionally, we incorporate the trifocal tensor in orderto combine the color and infrared features in a unified detectionframework and use these trained detectors for an experimentalevaluation of video sequences captured with our designed testbed.Our evaluation definitively demonstrates the performance gainsachievable when using the trifocal framework to combine colorand infrared features in a unified framework. Both of the trifocalsetups outperform their unimodal equivalents, as well as thetetravision-based analysis. Our experiments also demonstrate howthe trained detector generalizes well to different scenes and canprovide robust input to an additional tracking framework.

I. INTRODUCTION

T HIS paper presents a methodology for analyzing multi-modal and multiperspective systems for person surveil-

lance. Using an experimental testbed consisting of two colorand two infrared cameras, we can accurately register the colorand infrared imagery for any general scene configuration, sothe scope of multispectral analysis can be expanded beyond thespecialized long-range surveillance experiments of previous ap-proaches to more general scene configurations common to uni-modal approaches.

We design an algorithmic framework for detecting people ina scene that can be generalized to include color, infrared, and/ordisparity features. Using a combination of histogram of orientedgradient (HOG) features from the color and infrared domains,

Manuscript received October 28, 2007; revised March 9, 2008. First pub-lished July 9, 2008; current version published August 29, 2008. This work wassupported by the Technical Support Working Group and the U.C. DiscoveryGrant. This paper was recommended by Guest Editor Z. He.

S. J. Krotosky is with the Advanced Multimedia and Signal Processing Di-vision, Science Applications International Corporation (SAIC), San Diego, CA92121 USA (e-mail: [email protected]).

M. M. Trivedi is with the Computer Vision and Robotics Research Laboratory(CVRR), University of California at San Diego, La Jolla, CA 92093-0434 USA.

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2008.928217

we train a support vector machine (SVM) to detect people in thescene. Additionally, we learn the relationship between personsize and depth in the scene to create a disparity-based detector.We assume that the visual and disparity trained detectors can betreated independently and probabilistically combine their out-puts to create an overall detection score.

Within this framework, we train person detectors using colorstereo and infrared stereo features. We also analyze tetravision-based detectors that combine the detector outputs from sepa-rately trained color stereo and infrared stereo features. Addi-tionally, we incorporate the trifocal tensor in order to combinethe color and infrared features in a unified detection framework,doing so for both the color stereo single infrared and infraredstereo single color cases. We use these trained detectors foran experimental evaluation of video sequences captured withour designed testbed.

Our evaluation definitively demonstrates the performancegains achievable when using the trifocal framework to combinecolor and infrared features in a unified framework. Both of thetrifocal setups outperform their unimodal equivalents, as wellas the tetravision-based analysis. Our experiments also demon-strate how the trained detector generalizes well to differentscenes and can provide robust input to an additional trackingframework.

II. RELATED RESEARCH

Person analysis in multispectral and multiperspective im-agery is a relatively new area of research in computer vision.Analysis that incorporates a comparison between color andinfrared imagery for person analysis has been relatively sparseand limited in scope and generality.

Typical studies have looked at person detection by treatingcolor and infrared separately. For example, Zhang et al. [1] com-pared different image features in color and infrared monocularimagery for training a SVM. However, no direct comparison ofthe detection rates of color and thermal imagery was presented.Ran et al. [2] also looked at separately using color and thermalimagery to detect periodic motion to indicate pedestrians in thescene. The main goal of these studies is to show the extensibilityand adaptation of color image analysis techniques on infraredimagery.

Other studies have examined person detection as a fusion ofcolor and infrared imagery. Davis and Sharma [3] have con-structed a data set of color and infrared videos. The data set pro-vides for a frame-by-frame comparison of the color and infraredimagery and allows for the registration of the two videos as theview conforms to a planar homography assumption. This dataset has allowed for the development of algorithms that combinethe color and thermal imagery for improved background sub-traction [4], [5] and person detection and tracking [6].

1051-8215/$25.00 © 2008 IEEE

Page 2: 1096 IEEE TRANSACTIONS ON CIRCUITS AND …swiftlet.ucsd.edu/publications/2008/Krotosky_PersonSurv...1096 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO.

KROTOSKY AND TRIVEDI: PERSON SURVEILLANCE USING VISUAL AND INFRARED IMAGERY 1097

The planar homographic assumption is a convenient way toregister the color and infrared imagery. However, the assump-tion is also severely limiting in the types of scenes that can beanalyzed with multiperspective imagery. Because it is assumedthat all objects can be aligned with a single planar homographictransformation, the scene must be in a special configuration toachieve this assumption. Typically, this means that all objectsare sufficiently far from that camera that they satisfy the infi-nite homographic assumption. While this provides a method ofregistration, other scenes where people can be at multiple dis-tances from the camera, such as those commonly analyzed inmonocular and stereo imagery, cannot be analyzed in the planarhomographic framework. Our previous studies [7] have eluci-dated the ways that multispectral and multiperspective data canbe registered and have shown ways to register color and infraredimagery for any general scene configuration.

The most effective way to register color and thermal imageryfor any general scene context is to incorporate stereo imagerywhose depth estimates can account for the parallax inherent inany multiperspective scene. Bertozzi et al. [8], [9] has designeda four-camera “tetravision” system to analyze people in colorand infrared stereo imagery. Detection is performed separatelyin the color-stereo and infrared-stereo domains. The detectionresults are then fused by associating the detected boundingboxes from each modality based on their 3-D location in thescene.

We have introduced a trifocal approach to person detectionwith color and thermal imagery [10]. By incorporating stereodepth estimates from a single modality, we can register thesecond modality accurately using the trifocal tensor. Thispaper extends the multispectral framework proposed in [10],improving the method for combining the color and infraredfeatures and establishing a unified multispectral person detectorthat is used to identify people in novel imagery. Our results in-dicate a significant improvement in the accuracy and robustnessof unimodal detection frameworks, as well as state-of-the-artmultispectral detection frameworks.

III. TRIFOCAL TENSOR VERSUS HOMOGRAPHY

The trifocal approach to combining color and infrared im-agery allows us to compare a wider range of data than has pre-viously been analyzed in the literature. Typical approaches thatcombine color and infrared for analyzing pedestrians focus onscenes where objects appear very far away from the camera,such as those from the IEEE OTCBVS WS Series Bench [3].This allows for straightforward registration using a planar ho-mography assumption, yet limits the depths of field that can beanalyzed. This means that analysis must be confined to a re-stricted plane-of-interest in the scene or the cameras must beplaced to ensure that all areas in the scene will comply with thehomography.

Fig. 1 shows the differences in fields-of-view between: (a) atypical planar homographic approach and (b) our trifocal ap-proach to color and infrared analysis. Notice how people inthe planar homographic imagery are all very far away from thecamera and at a similar scale. This is a requirement of the planarhomographic approach and puts severe limits on the types ofscenes that can be fully analyzed within this framework. The

Fig. 1. Comparison of viable field of view for combining color and infraredimagery for (a) for planar homography and (b) our trifocal approach.

typical data from the trifocal framework are much more gen-eral and complex. People can be at a broad range of scales anddistances from the cameras. As long as depth estimates for animage region can be obtained, we can register the relevant pixelsof objects at any general position in the scene.

Fig. 2 illustrates the large range of scales that can be obtainedin the trifocal framework. It is a challenge to design a personclassifier that is able to handle such a broad range of scales as theextracted features need to be relatively invariant to these scalechanges. The incorporation of this scale range also greatly in-creases the number of candidates to consider, thereby increasingthe potential for false positives.

ALGORITHMIC FRAMEWORK

We wish to explore how the incorporation of color, infrared,and disparity features affect the classification and false posi-tive rates of a person detection system. To do so, we establisha framework for registering the multimodal imagery and ex-tracting features from this imagery that can be used to learn todetect people in a scene. Fig. 3 shows the algorithmic flow ofthis framework. We describe the details of our framework in thefollowing sections.

A. Image Registration With Trifocal Tensor

We use a three-camera approach, consisting of a unimodalstereo pair (color or infrared) combined with a single cameraof the second modality. We use the disparity estimates fromthe stereo imagery to register corresponding pixels in the thirdimage with the trifocal tensor—the set of matrices relating thecorrespondences between the three images.

The trifocal tensor can be estimated by minimizing the al-gebraic error of point correspondences [11]. Point correspon-dences can be obtained for trifocal imagery using the same cal-ibration techniques used for stereo calibration, where the cali-bration board is visible in each trifocal image. While only sevenpoint–point–point correspondences are required to compute thetrifocal tensor, in practice, we use many more correspondencesto smooth errors in the point estimates. The resulting trifocaltensor is written as , where is a 3 3 matrixfor the th image in the set. From this tensor notation, standard

Page 3: 1096 IEEE TRANSACTIONS ON CIRCUITS AND …swiftlet.ucsd.edu/publications/2008/Krotosky_PersonSurv...1096 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO.

1098 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008

Fig. 2. Range of scales at which people can be seen in trifocal framework.

Fig. 3. Algorithmic framework for person detection with color, infrared, anddisparity image features.

two-view geometry parameters, such as fundamental matrices, epipoles , and projection matrices can be determined.Additionally, given a point correspondence , we can

estimate the point transfer to the third image point as

(1)

Dense stereo matching gives correspondences, andthe point transfer to the third image is estimated and registeredto the reference image. Fig. 4 shows an example set of registeredtrifocal imagery.

B. Annotation

Now that we are able to accurately register all three modal-ities, we can extract positive and negative samples for classi-fication. Positive samples need to be annotated from video se-quences. Bounding boxes for all people in the scene were an-notated. For consistency in classification, all bounding boxesmaintain a 2:5 aspect ratio. Negative samples can then be gener-ated by translating the corresponding bounding box for a personto a nonperson region in the scene. Additional negative samplesare generated by selecting smaller subregions of the selectedpedestrian region. Although annotation needs to be done once

Fig. 4. Examples of using the trifocal tensor to register a third image to a stereopair. The left column shows an infrared image registered to a color stereo pairand the right column shows a color image registered to an infrared stereo pair.

for a single trifocal setup, it is necessary to repeat the annotationfor both the color stereo and infrared stereo frameworks, as theyhave slightly different fields-of-view. Care was taken to ensurethat samples were annotated from identical people at identicalframes in both cases to limit variability in the training. Addition-ally, only nonoccluded pedestrians were included in the trainingset. We expect the classifier will still be able to handle occlusionwithout explicitly training for it, and our experimental evalua-tion will validate this assumption. Fig. 5 shows example posi-tive samples annotated in color and infrared reference imagery.For each sample, we can simultaneously extract the referenceimage patch, its disparity image, and the reprojected image datato create the combined sample triplet.

C. Image Features

Once annotated, we must extract features that will be ableto differentiate the positive and negative samples. For the colorand infrared images, we elect to extract HOG features similar tothose proposed by Dalal and Triggs [12]. These features attemptto encode the relevance of edges in terms of their orientation

Page 4: 1096 IEEE TRANSACTIONS ON CIRCUITS AND …swiftlet.ucsd.edu/publications/2008/Krotosky_PersonSurv...1096 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO.

KROTOSKY AND TRIVEDI: PERSON SURVEILLANCE USING VISUAL AND INFRARED IMAGERY 1099

Fig. 5. Example positive samples of people extracted from (a) color stereo ref-erence images and (b) infrared stereo reference images. The top row shows thereference sample, the middle row shows the disparity sample, and the bottomrow shows the reprojected image sample.

and spatial position and have been increasingly utilized in manyrecent person classification publications [1], [13] . We resizeeach of the color and infrared image samples to a common sizeand compute an -element histogram, where , , and

are the sizes of histogram bins in width, height, and gradientorientation, respectively.

For the disparity image, we initially considered extractingHOG features as well. Our initial results indicated this wasvalid, as the ROC curves showed that the classifier trained oncolor, infrared, and disparity HOG features outperformed thosetrained on just color and infrared. ROC curves were generatedby training an SVM using 1654 positive samples and 22 520negative samples in a 90/10 cross-validation framework. Fig. 6shows the ROC curves for classifiers trained on variations ofcolor, infrared, and disparity HOG features. Fig. 6(a) showsthe ROC curves for the color stereo reference and Fig. 6(b)shows the ROC curve for the infrared color reference. Thecombination of color, infrared, and disparity performs thehighest when evaluating cropped samples in a cross-validationframework. This is a misleading result, though, as the ROC isconstructed only by classifying annotated image patches. Whenthe same classifier is applied to find people in novel images, theresulting regions include many false positives, often more thanthe number of people in the scene.

This performance drop-off is likely due to a combination offactors. First, it is likely that the HOG features are inappropriatefor disparity imagery. They are designed to capture edge proper-ties of an image patch, yet, for many positive samples of people,there are few to no edges in the disparity image. This is espe-cially true for people close to the background, where their dif-ference in disparity from the background is small. While addingthese HOG disparity features can provide some additional dif-ferentiability when classifying the carefully cropped and anno-tated image patches, the features actually give false positiveswhen classifying novel images, especially at regions near trueperson regions.

To find an alternative, we further examine the disparity im-agery to find features that help to differentiate people from the

Fig. 6. ROC curves showing the combination of color, disparity, and infraredfeatures when using HOG features for all modalities.

Fig. 7. Linear relation of bounding box height and median disparity for positivesamples of people. The data points are plotted in blue and the least-squares linearfit is plotted in red.

background and other objects in the scene. Using the knowl-edge that a person’s size is normally distributed around a meanvalue, we model the linear correlation between the size of thebounding box that encloses the person and the median of thedisparity inside that region. Fig. 7 shows the relationship be-tween the bounding box height and the median disparity andthe least-squares linear fit of the data.

This line is parameterized as , whereare the parameters of the line, is the image height,

Page 5: 1096 IEEE TRANSACTIONS ON CIRCUITS AND …swiftlet.ucsd.edu/publications/2008/Krotosky_PersonSurv...1096 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO.

1100 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008

and is the median disparity. For a candidate bounding box, wecan then compute its distance from this ideal line as

(2)

D. Learning and Classification

While the single disparity feature could potentially be addedto the color and infrared HOG features, it is likely that its powerfor classification would be lost in the hundreds of HOG fea-tures generated for each sample. Since arises from a dif-ferent modality and attempts to model a completely differentphysical property, it is appropriate to treat these features inde-pendently. We build one classifier for the visual HOG featuresand another classifier for the disparity feature. Each classifier’sresult can than be probabilistically combined to determine thefinal classification.

We train a person classifier using the HOG features from thecolor and/or infrared imagery using a SVM using radial basisfunction kernels [14]. We use cross validation during trainingto give probability estimates that a bounding box contains aperson, .

We model the disparity-based classification as being normallydistributed around the distance from the line learned inFig. 7. We compute the probability that a region contains aperson given the as

(3)

where is the complementary error function and is thestandard deviation control parameter of the modeled Gaussian.

is chosen such that 90% of the training samples lies within onestandard deviation of the mean. We then set the threshold so thatthe discrimination of the disparity-based classifier is relativelyweak, but filters out many areas that the HOG-based classifierwill not have to evaluate.

By making an independence assumption, we can constructthe final classification probability as

(4)

The additional benefit of using these features in two sepa-rate classifiers is that the relatively fast disparity classifier canbe used to reduce the number of bounding boxes that need tobe evaluated for the slower HOG-based classifier. For example,there are potentially of the order of evaluations. In practice,we have found that this can be reduced by two orders of mag-nitude to by only considering bounding boxes with highprobability from the disparity classifier.

IV. EXPERIMENTAL FRAMEWORK

A. Experimental Testbed and Image Acquisition

We need to establish a framework for experimenting and an-alyzing person surveillance detection approaches that will fa-cilitate a direct, frame-by-frame comparison of the various ap-proaches that combine color and infrared stereo imagery. We

Fig. 8. Experimental testbed. Two color cameras and two infrared cameras ar-ranged in stereo pairs and mounted to the front of the LISA-P testbed.

designed a custom rig, shown in Fig. 8, consisting of a matchedcolor stereo pair and a matched infrared stereo pair. The twopairs share identical baselines and have been aligned in pitch,roll, and yaw to maximize the similarities in field of view. Sucha rig will allow us to compare Color Stereo, Infrared Stereo, Tri-focal Color Stereo Infrared (CSI), Trifocal Infrared StereoColor (ISC), and Tetravision approaches to person detection. Afour-input video capture card is used to acquire the images, anda time-stamping synchronization routine is used to best align theasynchronously captured video sequences.

Calibration data were obtained by illuminating a checker-board pattern with high-intensity halogen bulbs so the checkswould be visible in both color and infrared imagery and stan-dard calibration techniques could be applied to obtain the in-trinsic and extrinsic parameters of the cameras. Color stereo andinfrared stereo calibration was obtained from the matched cal-ibration points using the Matlab Camera Calibration Toolbox[15]. The same point sets are also used to estimate the trifocaltensor for the trifocal CSI and ISC cases.

B. Data Set and Training

Videos were collected over several days in the scene shownin Fig. 4. Twenty-one sequences of 352 240, 30-fps videowere collected of different people moving throughout theenvironment at different times of day in an attempt to capturea wide range of illumination, position, occlusion, and densityconditions. Of those sequences, 19 were used for annotationand training, while the remaining two were reserved as test sets.The two separate test videos were selected for their challengingand dense scenes and the fact that the people in the scene werenot in the other videos. Cross validation was not used in theseexperiments, as the resulting detections were evaluated by ahuman operator and increasing the number of test sequenceswould make the evaluation unmanageable. For each sequence,we compute color stereo, trifocal CSI, infrared stereo, and tri-focal ISC variants of the original data using the dense disparitygeneration described in [16] with the trifocal tensor.

1) Annotation of Color Stereo and Trifocal CSI Data: Whenusing the color stereo as the base reference image, we annotated1654 positive samples of people in the scene. The positive sam-ples range over 21 scales from 6 to 46 pixels wide. For eachpositive sample, we attempt to obtain ten negative samples byrandomly translating the bounding box to a nonperson regionin the scene. We also obtain an additional five negative samplesby randomly selecting a subregion of the positive bounding box

Page 6: 1096 IEEE TRANSACTIONS ON CIRCUITS AND …swiftlet.ucsd.edu/publications/2008/Krotosky_PersonSurv...1096 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO.

KROTOSKY AND TRIVEDI: PERSON SURVEILLANCE USING VISUAL AND INFRARED IMAGERY 1101

as a negative sample. If a negative sample cannot be generatedafter a maximum number of iterations due to a dense scene orif the subregion is smaller than the smallest person scale, we donot include that negative sample. In all, 22 520 negative sampleswere gathered for training.

2) Annotation of Infrared Stereo and Trifocal ISC Data: Wemade every attempt to include identical samples for using theinfrared stereo as reference as in the color stereo case. However,due to the slightly different fields of view, this was not alwayspossible. Positive and negative samples were generated in thesame manner as the color stereo case, resulting in 1425 positiveand 19 533 negative samples. We do maintain the same scalerange of 6–46 pixels in bounding box width.

For training, the color and infrared parts of each sample areresized to 24 60 pixels. A 6 15 8-dimensional HOG fea-ture is computed for each of the color and infrared parts of thesample and used to train SVM classifiers with radial basis func-tion (RBF) kernels. We use cross validation of the training sam-ples to obtain probability estimates for the classifiers. SVM clas-sifiers are obtained for each of the four combinations of colorand infrared imagery.

The training data are also used to learn the bounding boxheight-to-disparity function used to classify people in the dis-parity domain. We obtain a linear estimate of the function forboth the color stereo and infrared stereo cases.

V. EXPERIMENTAL EVALUATION

We analyze the reserved test sequences from the 21 trainingsequences. The sequences include various people movingthrough the scene with other moving objects including othervehicles and a dog. Detection was determined to be a success ifthe appropriately sized bounding box encapsulated the personin the scene. Naturally, false positives arose when boundingboxes did not encapsulate a person region and missed detectionoccurred when a person was not found by the classifier. Allexperiments we performed offline. To achieve a real-timeanalysis, a real-time capable SVM would be required, as it isthe speed bottleneck of the system. The stereo evaluation andHOG feature extraction are currently real-time capable.

A. Comparison

The sequences were evaluated for the color stereo, trifocalCSI, infrared stereo, and trifocal ISC. Additionally, we com-pare our trifocal approaches to the tetravision approach pro-posed by Bertozzi et al. [8]. The tetravision approach utilizesa four-camera rig where color stereo and infrared stereo are an-alyzed independently and their detections combined to deter-mine the overall detection. We apply this philosophy in our anal-ysis by combining the results from our color stereo and infraredstereo using logical AND and OR operations on the boundingboxes.

Fig. 9 shows the ROC curves for a sampled portion of the en-tire sequence. Plotted data points were generated by analyzingeach classifier’s detection/false positive rate when the detec-tion probability threshold was set at 80%, 85%, 90%, and 95%.Fig. 10 shows example results of person detection using each

Fig. 9. ROC curve of person detection using color/infrared SVM with dis-parity-based classifiers.

of the compared approaches. In these examples, the detectionprobability was fixed at 90%.

Clearly, the two trifocal classifiers outperform the singlemodality classifiers by a large margin. For a false positiverate of one per frame, the multimodal classifiers increase thedetection rate by over 45%, from 0.65 to almost 0.95. Theseare impressive gains, and, while different feature selection anddata profiles could yield more modest gains, there is clearly asubstantial benefit in incorporating color and infrared featuresto create a superior discriminator of people in a scene. It isalso clear that incorporating the color and infrared featuresfor classification in this trifocal approach is better suited todetecting pedestrians than the independent classification andmerge philosophy of the tetravision approach. Again, for a falsepositive rate of one per frame, we see an increase in detectionof almost 20%. Incorporating the multispectral features at theclassifier level will yield much more accurate detection thancombining the detection results independently.

Of note is that color stereo is outperformed by infrared stereo,yet trifocal CSI performs better than trifocal ISC. While thismay seem counterintuitive, a careful analysis can illuminate thecause for this swap in comparative performance. Since we usethe stereo disparities to initially thin the person candidates inthe scene, it is not surprising to see the infrared stereo outper-form the color stereo case. The disparity generation algorithmwe used [16] relies on windowed correlation matching, wherehighly textured areas are more easily matched than areas of lowtexture. Since infrared imagery is inherently low textured, thenonperson regions produce fewer valid disparities, resulting infewer false positives. Similarly, the areas that contain peopleare likely to have valid disparity estimates and will stand outin the imagery. The color imagery is often similarly textured inperson or nonperson regions, so more candidates are generated,increasing the potential for false positives.

However, the opposite is true when both the color and in-frared is used in the SVM classifier. In this case, the increaseddisparity resolution and accuracy from the color stereo imageryallows for more accurate trifocal registration and improves theselection of candidate person regions. This improved fidelitymakes it easier for the color and infrared trained SVM to dif-ferentiate person and nonperson regions and yields the higherperformance shown in the ROC curves. We also expect that,when the trifocal registered modality is not available (e.g., color

Page 7: 1096 IEEE TRANSACTIONS ON CIRCUITS AND …swiftlet.ucsd.edu/publications/2008/Krotosky_PersonSurv...1096 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO.

1102 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008

Fig. 10. Example results of a frame-by-frame comparison of the person detection results using different combinations of color, infrared, and disparity features.Successful detections are shown in red, false positives in yellow.

Fig. 11. Good results for trifocal ISC.

at night or infrared during poor temperature conditions), the de-tection rates will move towards their unimodal counterparts.

B. Extended Analysis of Trifocal Detectors

We further focus our analysis on the top performing clas-sifiers. Fig. 11 shows successful detection results for exampleframes using the trifocal ISC framework. Fig. 12 shows exam-ples of successful detection for example frames using the bestperforming trifocal CSI framework. Notice how the frameworkyields accurate detection across a wide range of person scales,from people very large and near to the foreground to barely vis-ible people deep in the background. These figures also demon-strate the classifiers’ ability to suppress false positives fromother objects in the scene, including vehicles and dogs.

Table I expands the analysis of the best performing trifocalCSI classifier by including several additional analyzed videosequences. The resulting analysis reinforces the results of thecomparative analysis in Fig. 9, showing an overall detectionrate of 92.15% with 0.606 false positives per frame. This con-sistency further emphasizes the benefits of utilizing the trifocalCSI framework.

While the resulting detection rate is relatively high, we alsoachieve a seemingly high false positive rate of 0.606 falsepositives per frame (FPPs). However, the SVM was trained

to minimize the number of false positives per evaluated can-didate window sample (FPW). For each frame, we evaluate352 240 21 windows in the image, meaning our FPWis 3.4 . Fig. 13 shows examples of the false positivesgenerated in our detection framework. The false positives inthe images are shown in yellow. Our analysis has shown thatan overwhelming majority of the generated false positives arelocated in the areas shown in these examples. A refinement toour approach could be to bootstrap these and other repeatedfalse positives examples and retrain the SVM to achieve a lowerfalse positive rate.

C. Testing in Different Environments

Experiments were also conducted in another environment totest the trained person classifier’s robustness to variations inscene perspective, background, density and lighting conditions.Data were collected in a new outdoor environment that includedmultiple pathways through a grassy mall. Six video sequenceswere collected over several hours from two distinct perspectivesthat allowed for the capture of the natural movement of peoplein the environment. In general, these sequences were denser innumbers of people than the initial experiments.

The sequences were evaluated for the best performing tri-focal-based classifiers in the initial tests. The SVM classifiers

Page 8: 1096 IEEE TRANSACTIONS ON CIRCUITS AND …swiftlet.ucsd.edu/publications/2008/Krotosky_PersonSurv...1096 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO.

KROTOSKY AND TRIVEDI: PERSON SURVEILLANCE USING VISUAL AND INFRARED IMAGERY 1103

Fig. 12. Good results for trifocal CSI.

Fig. 13. Common false positive regions for trifocal CSI, shown in yellow.

TABLE IPERSON DETECTION FOR TRIFOCAL CSI FRAMEWORK AT 90% THRESHOLD

used to evaluate the sequences were identical to those in theoriginal tests. The disparity-based classifier was retrained to ac-count for the change in disparity-to-bounding-box-size functionin the new perspective. This can be done quickly by annotatinga handful of new examples and estimating the new best linearfit.

Fig. 14 shows an example of one of the densest frames in thetest sequence, where 13 people occupy the scene. The trifocalCSI classifier is able to successfully detect every person with nofalse positives, while the trifocal ISC classifier detects all but asingle pedestrian, again with no false positive. We emphasizethat no additional samples were used to evaluate these sequencesand many of the objects in the scene, such as the grass, tree,and foreground fencing have not been modeled explicitly bythe SVM classifiers. Figs. 15 and 16 show additional detectionexamples for the trifocal ISC and CSI cases, respectively.

Fig. 14. Detection in a crowded scene.

We compiled a comparison of the trifocal detection results fora test sequence in Table II. The results are on par with the orig-inal series of test sequences. We do see a noticeable decrease inthe per frame detection rate. This is likely due to the incorpora-tion of a new scene that has no support in the trained classifier.Additionally, these new test sequences have, on average, twiceas many people in the scene as the original test sequences. Thisincreases the occurrence of occlusion that can lead to a misseddetection of a person for an individual frame.

D. Temporal Filtered Detection and Tracking

We believe that these per-frame detection rates we achieveare really the lower bound and that increased performance cancome from the temporal analysis of the per-frame detections.In our analysis, we consider a missed detection for any framewhere a person was not properly encapsulated by a boundingbox. However, a single missed detection for a person in a givenframe is usually corrected in the next few frames. Such a missed

Page 9: 1096 IEEE TRANSACTIONS ON CIRCUITS AND …swiftlet.ucsd.edu/publications/2008/Krotosky_PersonSurv...1096 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO.

1104 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 8, AUGUST 2008

Fig. 15. Additional results for trifocal ISC.

Fig. 16. Additional results for trifocal CSI.

TABLE IIPERSON DETECTION COMPARISON FOR TRIFOCAL CSI AND ISC AT 90%

THRESHOLD

detection can be thought of as a missing data sample in a largertracking framework.

Fig. 17 shows a time-lapsed image of a typical 60-frame se-quence in our experiments, where the start and end frames over-layed on each other. For each person in the scene, we plot thecorrect per-frame detections in solid dots (blue, cyan, red, ma-genta, and green, respectively) and plot the missed detectionsin yellow circles. This plot demonstrates how the intermittentmissed detections would not detract from an overall trackingframework. The solid dots for each person clearly indicate thepath taken by each person and the missed detections are rela-tively few. This means that the missed detection rate is mostlydue to intermittent missing data points for tracking rather thanbeing completely unable to detect a person in the scene. Tem-poral analysis is a crucial aspect of algorithmic approaches tosurveillance, as the movement and interaction of objects in thescene can give fundamental insight to the situational analysis ofthe scene [17]. We feel that our trifocal classification approach

Page 10: 1096 IEEE TRANSACTIONS ON CIRCUITS AND …swiftlet.ucsd.edu/publications/2008/Krotosky_PersonSurv...1096 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO.

KROTOSKY AND TRIVEDI: PERSON SURVEILLANCE USING VISUAL AND INFRARED IMAGERY 1105

Fig. 17. Time-lapsed display of a typical experimental sequence withper-frame detection overlayed. Correct per-frame detections are shown incolored dots and missed detections are indicated as yellow circles.

gives a natural and robust input to common person tracking tech-niques such as Kalman [18] and particle filtering [19].

VI. CONCLUSION

We have presented a methodology for analyzing multimodaland multiperspective systems for person surveillance. By incor-porating an experimental testbed consisting of two color and twoinfrared cameras, we are able to expand multispectral color andinfrared analysis beyond the specialized long-range surveillanceexperiments of previous approaches to more general scene con-figurations common to unimodal approaches.

We presented an algorithmic framework for detecting peoplein a scene that probabilistically combines an SVM trained onHOG features extracted from color and infrared images with adetector based on the relationship between person size and depthin the scene to create a disparity-based detector. This frameworkwas used to train person detectors for the various combinationsof color and infrared multiperspective imagery, including colorstereo, infrared stereo, tetravision, and trifocal tensor configu-rations.

The trained detectors could then be used in an experimentalevaluation of video sequences captured with our designedtestbed. The evaluation definitively demonstrates the perfor-mance gains achievable when using the trifocal framework tocombine color and infrared features in a unified framework.Both of the trifocal setups outperform their unimodal equiva-lents, as well as the tetravision-based analysis. Our experimentsalso demonstrate how the trained detector generalizes well todifferent scenes and can provide robust input to an additionaltracking framework.

REFERENCES

[1] L. Zhang, B. Wu, and R. Nevatia, “Pedestrian detection in infrared im-ages based on local shape features,” Comput. Vis. Pattern Recognit.,2007.

[2] Y. Ran, I. Weiss, Q. Zheng, and L. Davis, “Pedestrian detection via pe-riodic motion analysis,” Int. J. Comput. Vis., vol. 71, no. 2, pp. 143–160,2007.

[3] J. Davis and V. Sharma, “Fusion-based background-subtraction usingcontour saliency,” in Proc. IEEE CVPR Workshop Object Tracking andClassification Beyond the Visible Spectrum, 2005.

[4] J. Davis and V. Sharma, “Background-subtraction using contour-basedfusion of thermal and visible imagery,” Comput. Vis. Image Under-standing, vol. 106, pp. 162–182, May 2007.

[5] C. O. Conaire, E. Cooke, N. O’Connor, N. Murphy, and A. Smeaton,“Background modeling in infrared and visible spectrum video forpeople tracking,” in Proc. IEEE CVPR Workshop on Object Trackingand Classification Beyond the Visible Spectrum, 2005.

[6] A. Leykin, Y. Ran, and R. Hammoud, “Thermal-visible video fusionfor moving target tracking and pedestrian classification,” Comput. Vis.Pattern Recognit., 2007.

[7] S. J. Krotosky and M. M. Trivedi, “Mutual information based regis-tration of multimodal stereo videos for person tracking,” Comput. Vis.Image Understanding, vol. 106, no. 2-3, pp. 270–287, May–Jun. 2007.

[8] M. Bertozzi, A. Broggi, M. Felias, G. Vezzoni, and M. D. Rose, “Low-level pedestrian detection by means of visible and far infra-red tetrav-ision,” in Proc. IEEE Conf. Intell. Vehicles, 2006, pp. 231–236.

[9] M. Bertozzi, A. Broggi, C. Caraffi, M. D. Rose, M. Felisa, and G.Vezzoni, “Pedestrian detection by means of far-infrared stereo vision,”Comput. Vis. Image Understanding, vol. 106, no. 2, pp. 194–204, 2007.

[10] S. J. Krotosky and M. M. Trivedi, “On color, infrared and multimodalstereo approaches to pedestrian detection,” IEEE Trans. Intell. Trans-port. Syst., vol. 8, pp. 619–629, Dec. 2007.

[11] R. Hartley and A. Zisserman, Multiple View Geometry in ComputerVision. Cambridge, U.K.: Cambridge Univ. Press, 2002.

[12] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” Comp. Vis. Pattern Recogn., vol. 1, pp. 886–893, 2005.

[13] F. Suard, A. Rakotomamonjy, A. Bensrhair, and A. Broggi, “Pedestriandetection using infrared images and histograms of oriented gradients,”in Proc. IEEE Conf. Intell. Vehicles, 2006, pp. 206–212.

[14] C.-C. Chang and C.-J. Lin, “LIBSVM: A Library for Support VectorMachines.” 2001 [Online]. Available: http://www.csie.ntu.edu.tw/cjlicn/libsvm

[15] J.-Y. Bouguet, “Camera Calibration Toolbox for Matlab.” [Online].Available: http://www.vision.caltech.edu/bouguetj/calib doc/

[16] K. Konolige, “Small vision systems: Hardware and implementation,”in Proc. 8th Int. Symp. Robot. Res., 1997.

[17] S. Park and M. M. Trivedi, “Multi-person interaction and activity anal-ysis: A synergistic track- and body- level analysis framework,” Mach.Vis. Appl., pp. 151–166, 2007.

[18] R. E. Kalman, “A new approach to linear filtering and prediction prob-lems,” Trans. ASME—J. Basic Eng., vol. 10, pp. 35–45, 1960.

[19] A. Doucet, C. Andrieu, and S. Godsill, “On sequential monte carlosampling methods for bayesian filtering,” Stat. Computing, vol. 10, no.3, pp. 197–208, 2000.

Stephen J. Krotosky received the B.S. degreein computer engineering from the University ofDelaware, Newark, in 2001 and the M.S. and Ph.D.degrees in electrical and computer engineering fromthe University of California, San Diego (UCSD), in2004 and 2007, respectively, specializing in signaland image processing.

He is currently an Algorithm Development Engi-neer with the Advanced Multimedia and Signal Pro-cessing Division, Science Applications InternationalCorporation (SAIC), San Diego.

Mohan Manubhai Trivedi received the B.E. degree(with honors) from the Birla Institute of Technologyand Science, Pilani, India, and the Ph.D. degree fromUtah State University, Logan.

He is a Professor of electrical and computerengineering and the Founding Director of the Com-puter Vision and Robotics Research Laboratory,University of California, San Diego. His teamdesigned and deployed the “Eagle Eyes” systemon the U.S.–Mexico border in 2006 as a part ofHomeland Security project. He served on a panel

dealing with the legal and technology issues of video surveillance organizedby the Constitution Project in Washington, DC, as well as at the Computers,Freedom and Privacy Conference. He will serve as the General Chair for IEEEAVSS 2008 (Advanced Video and Sensor based Surveillance) Conference. Heregularly serves as a consultant to industry and government agencies in theUnited States and abroad. He is serving as an Expert Panelist for the StrategicHighway Research Program (Safety) of the Transportation Research Board ofthe National Academy of Sciences.


Recommended