Pedestrian Detection and Tracking in Images and Videos · Pedestrian Detection and Tracking in...

Pedestrian Detection and Tracking in Images and Videos

Azar FazelStanford University

[email protected]

Viet VoStanford [email protected]

Abstract

The increase in population density and accessibil-ity to cars over the past decade has led to extensivecomputer vision research in recognition and detectionto promote a safer environment. Primarily, much ofthe research focuses on detecting pedestrians in or-der to reduce the chance of collision, and to improvetraffic control. The need for increased surveillance atthe work place and at home also promotes researchin this area. We implemented a pedestrian detectionand tracking algorithm by using histogram of orientedgradients (HOG) features and a linear support vec-tor machine (SVM) and Random Forest classifier. Ourgoal was to analyze and generate different boundingboxes for people within static images, and finally ap-ply this strategy to localize and track people withinvideos. We benchmarked different HOG parameters tofind the best model, and furthered our experimentationby comparing the effectiveness of SVM versus RandomForests. Our implementation was able to achieve 80%accuracy in static images, and was able to track pedes-trians in videos if the detected pedestrians’ poses donot vary significantly.

1. INTRODUCTION

Just in the United States, 5,000 of the 35,000 annualtraffic crash fatalities involve pedestrians [1]. Com-puter vision research in the area of pedestrian detectionis becoming increasingly crucial as more intelligentmotor vehicles are introduced into the streets. How-ever, pedestrian tracking and detection is inherently ahard problem to solve due to high intra-class variabil-ity and partial occlusions. Our goal was to benchmarkdifferent feature parameters from HOG, and compare

the success of SVM versus Random Forests for pedes-trian detection. We implemented our algorithms onPython and utilized several computer vision pack-ages from OpenCV, machine learning packages fromsklearn, and imaging processing packages from scikit-image. In order to train our model, we used a com-bination of 5,400 positive 64 x 128 images from theInria’s Person dataset [2], PETA dataset [3], and theMIT database [4]. Our negative images were also fromthese databases and images from the Daimler Monodataset [5], containing a total of 2,100 images. Foreach non-pedestrian image, 10 random windows of 64x 128 pixels were extracted for training, giving a to-tal of 21,000 negative images. This trained model wasthen used to test the detection accuracy on images, andtrack pedestrians in videos.

2. PREVIOUS WORK

Many techniques are being used today for pedes-trian detection. One such technique that is similarto HOG is Scale Invariant Feature Transform (SIFT).This technique generates features by using Difference-of-Gaussians (DoG) in an image’s scale-space pyra-mid to find interesting local keypoints. Each keypointwill have a orientation vector, and is invariant to scaleand rotation. Due to the high dimensionality of SIFTfeatures, principal component analysis is often usedin conjunction with SIFT [10]. Although SIFT canbe effective for detecting human features, Dalal andTriggs explained in their paper that locally normalizedHOG descriptors were more effective [6]. They ex-perimentally showed that the dense grid of uniformlyspaced cells and overlapping local contrast normaliza-tion improved the detection performance as comparedto SIFT. In fact, HOG features have been shown tohave 1 to 2 orders of magnitude less false positives

1

then other approaches.The Deformable Parts Model (DPM) is another

technique for object detection that performs well atclassifying highly variable object classes. In this tech-nique, for each image, a HOG feature pyramid isformed by varying the scale of the image, and defin-ing a root and parts filter. The root filter is coarse andis used to capture the general shape of the object, whilehigher resolution part filters are used to capture smallparts of the object. Objects are then detected by com-puting the overall score for each root location based onthe best possible placement of the parts [7].

The other technique in pedestrian detection is Con-volutional Neural Networks (CNN). This techniqueshows outstanding power in addressing the pedestriandetection problem, especially in the context of au-tonomous driving. In CNN, it learns which convo-lution parameters can produce better features to eas-ily predict an optimal output. Then it uses thesefeatures by extracting them from the last fully con-nected layers to train an SVM model for pedestriandetection[11][12].

3. TECHNICAL APPROACH

Using our dataset of positive and negative images,we extracted features using the histogram of orientedgradients technique described by Dalal and Triggs.This technique divides the image into dense equalsized overlapping blocks. Each of these blocks arefurther divided into cells which will be used to finda 1-D histogram of gradient edge orientations over thepixels of the cell. For this project, we experimentedwith block sizes that were 2x2 and 4x4, and cells sizesthat were 8x8 and 16x16 in order to find the best com-bination. For our histograms, we used 9 orientationbins across all experiments. Histograms for each blockare combined and finally normalized to have better in-variance to illumination and shadowing [6]. Figure1 shows an example of extracted HOG features for apedestrian.

In order to train the model, the feature vectors forthe images were fed into a linear SVM classifier. Thismodel was then used to classify pedestrians from non-pedestrians. We implemented a sliding window ap-proach to exhaustively search static images for win-dows with the scores greater than 0.2. The scores foreach window was calculated using the weight and bias

Figure 1: Example of HOG features, the right picture is theoriginal image and the left one is the extracted HOG

features.

found from our SVM model. Since our sliding windowwas kept at a constant size of 64x128 pixels, we imple-mented an image pyramid approach for our detection.In this approach, for each image, we scaled down theimage by 15% of its original size for several iterationsuntil the size is below a threshold of 64 pixels for widthand 128 pixels for height. For each iteration, our de-tector window searched the entire scaled image andcalculated scores using our SVM model. Once this al-gorithm was finished, scaled bounding boxes was dis-played on the original image and non-maximal sup-pression applied to eliminate redundant boxes.

In order to reduce false positive rates, we minedfor hard negative examples using our negative train-ing data. We extracted all false positive objects foundwithin negative images and included these examplesinto our training data for retraining the classifier.

Due to the exhaustive search performed duringHOG feature extraction, the time complexity forobject detection is very high. This poses a problem forpedestrian tracking in videos because detection rateswould be too slow. In order to remedy this problem,objects that are moving will be extracted from eachframe using background subtraction [8]. Using thismethod, we detected motion by segmenting movingobjects from the background and passing these smallerimages into our model instead of passing the wholeframe for detection. The nth frame can be representedas In which is its intensity value. In−1 will correspondto the previous frame. Doing a pixelwise subtraction,we get the equation

2

Mn =

{In(i, j) ∆(i, j) ≥ Tthreshold

0 ∆(i, j) < Tthreshold

where i and j are pixel positions and Mn is the mo-tion image. By finding the motion image, we can dra-matically reduce the complexity of our computation[9]. Using these motion images, we were be able torun our model on video frames much faster than whenwe did not have any motion detection.Figure 2 provides a summary of the steps we have inour detection algorithm.

4. EXPERIMENTS AND RESULTS

To obtain the model with the highest accuracy, wetried two different classifiers: SVM and Random For-est. To evaluate our models, we tested them on vali-dation set including 1,000 pedestrian images. For theSVM classifier, we investigated different regulariza-tion parameters(C) to get the highest accuracy. Theregularization parameter tells the SVM optimizationhow much we want to avoid misclassifying each train-ing example. For large values of C, the optimizationwill choose a smaller-margin hyperplane if that hy-perplane does a better job of getting all the trainingpoints classified correctly. Conversely, a very smallvalue of C will cause the optimizer to look for a larger-margin separating hyperplane, even if that hyperplanemisclassifies more points. We got the highest accuracyfor the SVM model when the regularization parameterhas the value of 0.001.

For Random Forest, we examined different numberof trees in training the model. Random Forest usesbagging (picking a sample of observations rather thanall of them) and random subspace method (picking asample of features rather than all of them) to grow atree. If the number of observations is large, but thenumber of trees is too small, then some observationswill be predicted only once or even not at all. If thenumber of predictors is large but the number of treesis too small, then some features can be missed in allsubspaces used. Both cases results in the decrease ofrandom forest predictive power. But the last is a ratherextreme case, since the selection of subspace is per-formed at each node. In general, the more trees weuse the better get the results. However, the improve-ment decreases as the number of trees increases, i.e.at a certain point the benefit in prediction performance

from learning more trees will be lower than the cost incomputation time for learning these additional trees.For our dataset, Random Forest provided the best ac-curacy with 1,000 trees.

Furthermore, in order to tune the hyper parametersfor HOG features, we extracted them using differentblock sizes and cell sizes. Table 1 shows the resultsof these experiments. As seen in this table, the blocksize and the cell size have a significant affect on theaccuracy of our models. In the other words, the effec-tiveness of the models strongly depends on the HOGfeature parameters. Also from the table, we can seethat the Random Forest outperforms SVM in all thecases except when the block size is 2x2 and the cellsize is 4x4 which the accuracy of the SVM model ishigher than the Random Forest. According to these re-sults, we can conclude that there is no optimal config-uration for HOG features and it depends on the datasetwe are using. In order to reduce false positive rates inour models, we exhaustively searched all 2,100 nega-tive images and extracted 5,800 windows with the sizeof 64x128 pixels as false positive objects and then re-train our model with the new augmented set. Using1,000 new negative images for validation, the originalmodel had a false positive rate of 0.005% while thenew model with hard negative mining had a 0% falsepositive rate. Most of the false positives came from ob-jects that are erect and skinny such as poles and trees.However, our hard negative mined model eliminatedmany of these false positives. An example of this im-provement is seen in Figure 3.

As mentioned in the section 3, since we found mul-tiple bounding boxes for each object, we used non-maximal suppression to remove the redundant bound-ing boxes. Figure 4 shows an example of using non-maximal suppression for two images.

For the purpose of background subtraction, wecalculated a reference image using a GaussianMixture-based background/foreground segmentationalgorithm. Then, we subtracted each new frame fromthis image to compute a foreground mask. The resultis a binary segmentation of the image which highlightsregions of non-stationary objects. This way we werebe able to get the segmentation of moving regions inimage sequences in Real-time. Figure 5 shows an ex-ample of the background subtraction for one frame ofa video.

3

Figure 2: Flow Chart of Major Steps for Pedestrian Detection and Tracking.

Block size Cell Size SVM Accuracy Random Forest Accuracy2 8 80.8% 67.7%2 16 69.7% 81.1%4 8 66.7% 69.1%4 16 0% 80.4%

Table 1: The accuracy of SVM and Random Forest models using different HOG parameters

Figure 3: Reduction in false positive rates using hardnegative mining. Poles and other erect patterns wereeliminated in the new SVM model trained with the

augmented negative dataset.

To track the pedestrian in videos, after applyingthe background subtraction and getting the foregroundmask, we found the contours for each frame and thencomputed the bounding boxes for each contour of thatframe. Since all the training images have the size of 64x 128, we re-sized the contours whenever their heightsand widths were smaller than our training image sizes.This was accomplished by adding some padding to theheights of widths of the contours. Afterward, we ap-plied our classifier on that contour to see if the contouris a pedestrian or not. We then used the non-maximalsuppression technique to remove multiple boundingboxes for each object. Figure 6 shows the results for

Figure 4: Redundant bounding boxes were eliminatedusing non-maximal suppression on these images.

one frame of a video. As seen in this figure, there aresix pedestrian in the frame and the classifier detected 3of them. The others either are occluded or they are ina pose that the classifier can not detect. To observe if

4

Figure 5: Example of applying background subtraction onone frame of a video.

Figure 6: Pedestrian detecting and tracking in one frame ofa sample video.

there are any confusion for the classifier when there aremoving objects other than the pedestrians, we tested iton the videos that have different type of moving ob-jects. Figure 7 is an example of a frame that containspedestrians, motorcycles and a truck. As shown in thefigure, the classifier only detected pedestrians and ex-cluded the truck and motorcycles.

Furthermore, to evaluate our classifier, we tried 5different videos with different duration. The totalnumber of the pedestrians in these videos was 40 andour classifier detected 24 of them which means it has60% accuracy. As a side note, we should mentionthat finding videos that have both moving pedestriansand moving non-pedestrian objects was difficult sincesince the cameras of most videos were not fixed and sothe detection was not possible. We instead evaluatedour classifier on a few number of videos. To watcha complete demo of the performance of our classifier,please refer to the Youtube link that we have provided

Figure 7: Pedestrian detecting and tracking in a frame withdifferent type of moving objects.

in section 6.

5. CONCLUSION AND FUTURE WORK

We have demonstrated that HOG feature descriptorscombined with SVM or Random forest and negativehard mining provides an effective strategy for pedes-trian detection and tracking. The main draw backsof this approach is that our model is unable to tracka large variety of human poses, and can only trackpedestrians after some delay is added to the video dueto the high complexity of HOG extraction. For manyvideos, occlusion is often present while pedestrians aremoving in the scene, causing difficulties in detection.In the future, other techniques for pedestrian trackingcan be added to our system such as optical flow andKalman filtering. Tracking humans is inherently a dif-ficult problem in the computer vision society, but solv-ing this problem can greatly reduce the number of an-nual motor vehicle casualties, and reduce crime ratesthrough improved surveillance systems at home and atwork.

6. GITHUB AND YOUTUBE LINKS

Our GitHub Code:https://github.com/afazel/CS231A_ProjectOur Youtube Video on Pedestrian Tracking:https://www.youtube.com/watch?v=0lEJIh6dWAE

References

[1] 2014 Motor Vehicle Crashes: Overview. U.S.Department of Transportation, March 2016.

5

[2] http://pascal.inrialpes.fr/data/human/

[3] http://mmlab.ie.cuhk.edu.hk/projects/PETA.html

[4] http://cbcl.mit.edu/software-datasets/PedestrianData.html

[5] S. Munder and D. M. Gavrila. An ExperimentalStudy on Pedestrian Classification. IEEE Trans-actions on Pattern Analysis and Machine Intelli-gence, vol. 28, no. 11, pp.1863-1868, November2006

[6] Dalal Navneet, Triggs Bill Histograms of Ori-ented Gradients for Human Detection. Interna-tional Conference on Computer Vision & PatternRecognition - June 2005.

[7] Felzenszwalb Pedro, Girshick Ross, McAllesterDavid and Ramanan Deva. Object Detection withDiscriminatively Trained Part Based Models.

[8] Nan Lu, Jihong Wang, Q.H. Wu and Li YangAn Improved Motion Detection Method forReal-Time Surveillance . Annalen der Physik,322(10):891921, 1905.

[9] Nan Lu, Jihong Wang, Q.H. Wu and Li Yang His-tograms of Oriented Gradients for Human Detec-tion. IAENG International Journal of ComputerScience, 35:1.

[10] Zickler Stefan, and Efors Alexei Detection ofMultiple Deformable Objects using PCA-SIFT.Carnegie Mellon University. 2007.

[11] Canyameres Masip, Sergi, and Antonio ManuelLpez Pea On the use of Convolutional NeuralNetworks for Pedestrian Detection. 2015.

[12] Szarvas, M., Yoshizawa, A., Yamamoto, M.,and Ogata, J. Pedestrian detection with convolu-tional neural networks. Intelligent Vehicles Sym-posium, 2005.

6

Date post:	30-May-2018
Category:	Documents
Upload:	trinhkhanh
View:	219 times
Download:	0 times

Pedestrian Detection and Tracking in Images and Videos · Pedestrian Detection and Tracking in...

Documents