[IEEE Communication Technologies, Research, Innovation, and Vision for the Future (RIVF) - Ho Chi...

Video Monitoring System: Counting People by TrackingDat Nguyen, Loc Huynh, Thang Ba Dinh, Tien Dinh

Faculty of Information Technology University of Science, Vietnam National University-Ho Chi Minh

{ntdat, hvloc}@apcs.vn, {dbthang, dbtien}@fit.hcmus.edu.vn

Abstract—We present a video monitoring system to count the number of people in an open area such as an airport, a bus station, or a shopping mall using a single camera. Our system automatically infers the number of people based on a novel multi-people tracker. The tracking framework is formulated as a data association problem, in which the people are detected in every frame and then all of the detection responses are associated to form the right track for each person. The detection step is carried out by an efficient state-of-the-art method while the data association is done by using a hierarchical framework, in which the detection responses form short tracklets, the short tracklets form longer ones, and so on. After that the number of people is counted based on the number of tracklets, from which the number of entered and exited people is also addressed. Beyond the scope of this system, we also aim to assign the right person into the right track instead of counting them only by proposing a novel appearance model to help re-identify a person, especially after heavy occlusion. This proposed solution is equivalent to addressing the problem of ID switching in multi-target tracking. The system is tested in both indoor and outdoor environments and in public datasets together with our new dataset. We also compare different components in our framework and other state-of-the-art trackers.

Keywords-pedestrian tracking; people counting; data association; video monitoring

I. INTRODUCTION Video monitoring is an important application not only for

security but also for marketing purpose. One of its crucial tasks is to count the number of people. The problem can be narrowed down to counting the people passing a door or a gate. However, that strategy is not applicable when the task is taken in an open area, which makes it much more challenging. Also, it is costly and not feasible to implement multiple sensors in such places, especially when heavy occlusions exist.

One of the most efficient methods to solve the people counting problem is using a single camera mounted on top of the entry and exit points to count the number of people [2,13]. Usually, the number of people is counted by applying head detection or background subtraction, then segmentation, and simple blob tracking. However, this setting is not applicable in an open area. The other approach is using many types of sensors in order to detect people. For example, Son et. al. [14] propose to use two photo-beam sensors to count people in mountain areas. More recently, Kutschera et. al. [15] propose to use a sensor-mat to count the number of people stepping on it. Even though these proposed methods can provide real-time performance, it requires to be applied in a very strictly small area or at exit and entry points. With a different approach, several methods take advantage of multi-target tracking

algorithms to help counting the number of people [1,16]. However, these methods employ a simple background subtraction to detect object, which is vulnerable in practice, where people may appear in groups.

In this paper, we present a novel pedestrian tracker that can be applied directly to count people entering and leaving a site with high accuracy. This tracking system uses a hierarchical association of detection responses. It takes full advantage of not only the motion information but also the appearance information by adopting the online learning framework of discriminative appearance models. Within a time sliding window, detection responses are gathered to form short track fragments (aka. tracklets) based on spatial-temporal information. Then, these tracklets provide us affinity and discriminative information, which can be used to collect samples for learning models. For better learning of target-specific appearance model, we classify these tracklets into gallery and query tracklets based on how reliable they are as in [5]. In the learning process, AdaBoost is applied to maximize the discriminative capability. Finally, we can integrate this model to the data association framework to effectively track and count people in the scene. By depending on the tracking system, we need not define lines, or virtual gate as mentioned in [1,2] to count people. This makes our system independent to location, and camera view. Thus, it can be applied in many

Figure 1. Overview of our system

978-1-4673-0309-5/12/$31.00 ©2012 IEEE

real-world scenarios. It is also important to note that we do not address the problem of people counting in heavy crowded environments as in [3,17] where the number of people is often estimated by statistical models. The overview of our approach is shown in Figure 1.

II. PEDESTRIAN TRACKING We use the detector [4] and employ the state-of-the-art

Person Identity Recognition based Multi-Person Tracking (PIRMPT) framework [5] since it has been proved to effectively overcome these errors and shows robust performance in terms of pedestrian tracking. Unlike PIRMPT [5], we integrate a pool of part-based intensity templates to the framework to describe image texture better. The framework has two basic phases: low level association and high level association.

A. Low Level Association The low-level association generates short but reliable

tracklets by linking two similar responses as in [5]. Basic spatial-temporal information and low-level feature such as color histogram are used. The affinity score S between two responses in neighboring frames can be defined as the product of position, size, and color histogram. Assume that there cannot be two people that one response can belong to in a particular frame. We apply dual-threshold strategy for safe and conservative associations as in [5].

Observing that some tracklets generated in the low level association phase are so short and more likely to be false alarms, we classify low level tracklets into gallery and query tracklets as in [5] to get target-specific model.

B. High Level Association This step associates the obtained tracklets to form complete

pedestrian trajectories. This can be formulated as a MAP problem, which can be solved by Munkres assignment algorithm as in [6]. Define as the set of all tracklets in the previous steps. The affinity score between tracklets is given by the following formula: | | (1)

Using Kalman filter, we get the refined position and the velocity of the head part or tail part of . The motion model is defined as: , .

, (2)

Where is the frame gap between the tail of and the head of . The temporal model is defined as: | 1, 00, (3)

This is a direct implication from the observation that it is impossible to link a tail of to a head of if appears earlier than .

We will discuss in detail the appearance model in the next section.

III. APPEARANCE MODEL

A. Descriptor Selection For each detection response, appearance descriptors are

selected from a large set of feature at different image channels and local regions. We do not use all of these descriptors since not all descriptors contribute equally to the appearance model. Therefore, we implement the offline training on the ground truth of CAVIAR dataset to get the effective appearance descriptor in comparing two images by applying the standard Adaboost algorithm as in [5]. After the offline training, we obtain the smaller set of descriptors that is used for the online training as the feature pool proposed in [7].

B. Collecting Samples To learn an appearance affinity model, a training sample is

defined as pair of detect responses. Based on the spatial and temporal information, we collect the sample as follows. Positive samples are pairs of responses in the same trajectory since the tracklets are reliable in our assumption. Also, based on our observation, one target cannot appear at different location at the same time, negative sample are pairs of responses in different tracklets. Therefore the training set can be defined as , where: , , : 1| , , , : 1| , ,

For each gallery tracklet, the training data is used to learn a target-specific appearance-based affinity model. For a query tracklet, the training data contain the union of all in our design since a query tracklet is less effective to learn a meaningful target-specific model.

C. Extracting Features 1) Object Appearance Descriptors

We use color histograms, HOGs [9], and intensity-based templates to describe the object. For color appearance, we use RGB color space and extract each color channel to get vectors , , . Then we concatenate them to form a single vector , , . In our design, 8 bins are used for each channel, so the feature vector has 24. To capture the shape of object, the HOG features [9] are used. Concretely, we compute the 8 orientations bins from each of 2x2 cells over a region R; then, we concatenate them to obtain a 32D HOG feature. The intensity-based template also obtains great results in tracking [10,11] which is great information to improve the tracking results. We use 15x15 normalized intensity patches as the other descriptor for the object. We propose to use a pool of templates (POT) which includes set of normalized intensity patches. To increase the speed of the tracker, we limit the number of responses in the pool of templates by setting a threshold. It means that if the distance of a response to the POT is not high enough then it should not be added into the pool. Given a detection response, we extract the features at different locations and scales to effectively describe the object.

In short, the appearance descriptor of a response can be summarized by the formula: , , (4)

where , are the color histogram and histogram of gradient at region R. is the normalized intensity patch. In our implementation, we choose 8 regions for color histogram and HOG. For intensity normalized patches, only torso part is chosen as it is suitable for template-based calculation. Thus, the combination of these features gives us a total of 17 cues for the feature pool in total.

Note that we do not extract local descriptors for the detect responses whose visibility ratio is less than a certain threshold based on an occupancy map [18].

2) Sample Descriptors We use normalized cross-correlation as the measurement to

overcome lighting and exposure condition. The similarity of sample descriptors is defined as following: , 1 , (5)

where is an element of one descriptor.

In comparing image texture, elements in the pool of templates of the two targets are compared one by one using above measurement. In the end, the maximum distance is selected. We denote the distance between appearance descriptors over region R as , , , , , . Then the feature vector is defined as: , , , … , , , , , … , , , , where

This feature vector gives us a feature pool so that we can use Adaboost algorithm to combine those cues into strong classifier.

IV. EXPERIMENTAL STUDIES We tested the accuracy of our people counting system in

both simple and complex scenarios. These scenarios include pedestrians moving at either slow or very high speeds, and several pedestrian interactions such as pedestrians included occlusion of one another, passing each other, or gathering and moving in different directions.

The experiment is carried out on several scenarios including indoor and outdoor video sequences. CAVIAR [8] is a popular public dataset, which is captured in a shopping center using a fix camera. For the outdoor environment, we test on PETS 2009 dataset [21] and video sequences captured on university campus. These datasets are very challenging because of the heavy occlusion as well as complicated motion of pedestrian in the video. Moreover, the resolution, lighting conditions, and camera views are different for each dataset. We also captured a video sequence on a university campus to test how the system performs in a real scenario.

A. Tracking Evaluation We evaluate both tracking and counting performance. For

tracking evaluation, we use the common metrics proposed in [12]. The evaluation result is presented in Table I. Our result has the least IDS. This result is based on the low rate of ML and MT rate is relatively high. As shown in the Table I, our ML has the least rate of all, and the MT is better than most of the previous methods. Note that our experiment uses different detector with these methods, so the results are just relative to the performance of the detector. Some sample results in both indoor and outdoor environment are showed in Figure 2.

Our tracker is implemented based on the tracker [5], so we compare our tracker to that to get an objective view of how our tracking method improves. The experiment is conducted on only challenging video sequences with heavy false alarms, missed detections and occlusions. To get a fair comparison, we use our detector and choices of feature regions to perform the experiment. The result is shown in the Table II. The first row represents the result that only uses color histogram. Even though the PT rate is the best, the overall performance in this case is not high because of many ID switches happen when objects share similar appearance. Our method shows slightly better performance than PIRMPT while they both share the same few IDS.

B. Counting Evaluation To evaluate the counting accuracy, we use standard

measurement for evaluation the quality of process. The results are shown in Table III. The first two columns represent the correct number of people entering and exiting the region of interest. The next two columns show the number of people counted by the system. From the table we see that our system achieves a very high accuracy in both indoor and outdoor experiments with many different camera views and light conditions.

V. CONCLUSIONS We have presented an automatic video monitoring system

by counting people. Our framework takes full advantage of pedestrian detector, pedestrian tracking to enhance the counting results. Different from other monitoring systems dealing with entry and exit points, our method can be applied at open large areas which enables a large variety of

(6)

TABLE I. TRACKING RESULT ON CAVIAR DATASET

Method FAPF GT MT PT ML IDS Wu et al. [18] 0.281 140 75.7% 17.9% 6.4% 17 Zhang et al. [19] 0.105 140 85.7% 10.7% 3.6% 15 Xing et al. [20] 0.136 140 84.3% 12.1% 3.6% 14 Huang et al. [6] 0.186 143 78.3% 14.7% 7.0% 12 Ours 0.113 140 85.3% 13.2% 1.5% 9

TABLE II. COMPARISON WITH OUR RE-IMPLEMENTATION OF [5]

Method GT MT PT ML FAPF IDS

Color 49 40 4 5 0.197 4

PIRMPT (*) 49 42 3 4 0.185 3

Color+HOG+POT 49 43 2 4 0.181 3

applications. We also proposed a new appearance cue to improve the state of the art tracking algorithm which gives us very high accuracy. In the future, we would like to explore the behaviors of the people to provide more information for our video monitoring system.

ACKNOWLEDGMENT This paper was funded by the Advanced Program in

Computer Science, University of Science, Vietnam National University-Ho Chi Minh.

REFERENCES [1] X. Liu, P.H. Tu, J. Rittscher, A. Perera and N. Krahnstoever, “Detecting

and counting people in surveillance applications”, IEEE Conf. on Advanced Video and Signal Based Surveillance, Italy, pp. 306-311, Sep. 2005

[2] A. Albiol, “Real-Time high density people counter using morphological tools”, IEEE Int. Conf. on Computer Vision and Pattern Recognition”, pp. 594-601, June 2006.

[3] Danny B. Yang, Hector H. Gonzalez-Banor, and Leonidas J. Guibas, “Counting people in crowds with a real-time network of simple image sensors”, in Proc. 9th Int. Conf. on Computer Vision, pp. 122-129, 2003.

[4] C. Huang and R. Nevatia, “High Performance Object Detection by Collaborative Learning of Joint Ranking of Granules Features”, in CVPR, 2010.

[5] C.-H. Kou and R. Nevatia, “How does Person Identity Recognition Help Multi-Person Tracking”, in CVPR, 2011.

[6] C. Huang, B. Wu, and R. Nevatia. “Robust object tracking by hierachical association of detection responses”, in ECCV, 2008.

[7] C.-H. Kuo, C. Huang, and R. Nevatia, “Multi-target tracking by on-line learned discriminative appearance model”, in CVPR, 2010.

[8] CAVIAR dataset. http://homepages.inf.ed.ac.uk/rbf/CAVIAR/. [9] N. Dalal and B. Triggs, “Histogram of oriented gradient for human

detection”, in CVPR, 2005 [10] Z. Kalal, J. Matas, and K. Mikolajczyk, “P-N learning: Boostrapping

binary classifiers by structural constraints”, in CVPR, 2010 [11] T. Dinh, N. Vo, and G. Medioni, “Context tracker: Exploring supporters

and distracters in unconstrained environments”, in CVPR, 2011. [12] Y. Li, C. Huang, and R. Nevatia, “Learn to Associate: Hybridboosted

Multi-Target Tracker for Crowded Scene”, in CVPR, 2009 [13] L.E. Aik and Z. Zainuddin. “Real-Time People Counting System using

Curve Analysis Method”, in IJCEE, pp 77-83, 2009. [14] B.-R. Son, S.-C. Shin, J.-G. Kim, and Y.-S. Her. “Implementation of the

Real-Time People Counting System using Wireless Sensor Networks”, in International Journal of Multimedia and Ubiquitous Engineering, vol. 2, no. 3, pp 63-79, Jully 2007

[15] C. Kutschera, M. Horauer, M. Rayy, D. Steinmairz, and P. Gorskix. “A Flexible Sensor-mat to Automate the Process of People Counting”, in International Conference on Advances in Circuits, Electronics and Micro-electronics, 2011.

[16] C. -C. Chen, H. –H. Lin, and O. T. -C. Chen. “Tracking and Counting People in Visual Surveillance Systems”, in ICASSP, 2011.

[17] D. Fehr, R. Sivalingam, V. Morellas, N. Papanikolopoulos, O. Lotfallah, and Y. Park. “Counting People in Groups”, in AVSS, 2009.

[18] B. Wu and R. Nevatia. Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detections. International Journal of Computer Vision (IJCV), 75(2) :247-266,November 2007.

[19] L. Zhang, Y. Li, and R. Nevatia. Global Data Association for Multi-Object Tracking using Network Flows. In CVPR, 2008

[20] J. Xing, H. Ai, and S. Lao, “Multi-Object Tracking Through Occlusions by Local Tracklets Filtering and Global Tracklets Association with Detection Responses”, in CVPR, 2009

[21] PETS 2009 dataset, http://www.cvg.rdg.ac.uk/PETS2009

TABLE III. COUNTING RESULT ON DIFFERENT DATASETS

Real enter Real exit System enter System exit True Positive False Negative False Positive Precision Recall

Caviar 140 92 138 89 213 19 14 0.94 0.92

PETS 49 40 47 39 82 7 4 0.95 0.91

Campus 37 33 37 36 68 2 5 0.93 0.97

Figure 2. Tracking and counting result in different environments

Date post:	04-Dec-2016
Category:	Documents
Upload:	tien
View:	213 times
Download:	0 times

[IEEE Communication Technologies, Research, Innovation, and Vision for the Future (RIVF) - Ho Chi...

Documents