CNN-based Crowd Counting Methods
Tannaz R.Damavandi Elinor Huntington
Introduction
Crowd counting has a wide range of applications that cross the boundaries of science and
engineering such as:
● Geopolitical and civic applications
● Crowd control and public safety
● Transportation systems design and traffic control
● Counting cells or bacteria on the microscopic level
Image source : http://www.robots.ox.ac.uk/~vgg/projects/seebibyte//images/Counting3.jpg
Imgae source https://i.kinja-img.com/gawker-media/image/upload/s--PgpCmwTr--/c_scale,fl_progressive,q_80,w_800/ezbhvc4qy5vdeebfwcgx.jpg
Introduction (Cont’d)
This challenging task needs to consider many factors such as inter-occlusion between people
and similarity among background features and crowds faces.
Image source: ShanghaiTech dataset
Background
Herbert Jacob’s method (1967)
Crowd Count = Avg. number of people in a section * Number of sections
Drawback:
● Crowds not distributed uniformly
Solution:
● Estimate the count for each patch and add all
these estimates together.
Image source : https://airphotoslive.com//wp-content/uploads/2013/04/crowd-counting-02.jpg
Background (Cont’d)
Most of the proposed automated models for crowd counting are not capable of handling
large crowds, especially when the number of people exceeds hundreds of thousands.
Three main crowd counting methods:
● Pixel-based analysis
○ Edge info and density map
(Zhang et al Model , MCNN , SCNN)
● Texture-based analysis
○ Fourier analysis
● Object level analysis
○ Locate individual in a scene
Cross-scene crowd counting via deep convolutional neural
networks (Zhang et al. Model)(2015)
This model is the precursor to MCNN and SCNN.
Model: ● 3 convolution layers.
● 3 fully connected layers.
● 2 Max pooling layers with a 2 × 2 kernel size.
● Activation function: ReLU
WorldExpo’10 crowd counting dataset was firstly introduced by Zhang et al. This dataset contains 1132 annotated video sequences
which are captured by 108 surveillance cameras, all from Shanghai 2010WorldExpo.
Zhang et. al model(Cont’d)
MCNN (Multi-Column Convolutional Neural Network)
Two natural configuration to crowd count using CNNs
1- Direct headcount
2- Density map of the crowd
MCNN is in favor of second model
Advantages:
● Features learned by each column are adaptive to variations in people/head size due to perspective
effect or image resolution.
● True density map is computed accurately based on geometry-adaptive kernels which do not need
to know the perspective map of the input image.
MCNN
Model: ● 3 parallel CNNs with different size of local receptive fields
● 2 Max pooling is applied for each 2×2 region.
● Activation function:Rectified linear unit (ReLU)
Data sets
Table 1 - Comparison of Shanghai Tech dataset with existing datasets: Num is the number of images; Max is the maximal crowd count;
Min is the minimal crowd count; Ave is the average crowd count; Total is total number of labeled people.
MCNN-Density Map via Geometry Adaptive Kernels
Accurate estimation of the crowd density
Homography between the ground plane and the image plane
The geometry of the scene
Uniform distribution of crowd around each head
Average KNN
Original images and corresponding crowd density maps obtained by convolving geometry-adaptive Gaussian kernels.
MCNN (Cont’d) Loss function :
𝛩 : a set of learnable parameters in the MCNN.
N : number of training image.
Xi :input image and
Fi :the ground truth density map of image Xi.
F(Xi; 𝛩) : estimated density map generated by MCNN
which is parameterized with 𝛩 for sample Xi.
L : loss between estimated density map and the ground truth density map.
The loss function can be optimized via batch-based stochastic gradient descent and
backpropagation.
SCNN
● Switching Convolutional Neural
Network
○ 3 small CNNs (aka Regressors)
○ 1 VGG16-based switch
● Images are patched
● Each patch is processed by the
switch and sent to one of the
Regressors
● Output is a density map
https://arxiv.org/pdf/1708.00199.pdf
Data
● Each input image is
patched into 9 smaller
images
● If training, the ground truth
is transformed into a
density map for model
output comparison
SCNN Regressors
● Based on the MCNN
Regressor architecture
● Four convolutional layers, 2
max pooling layers, and a
final 1 x 1 layer to
transform features into a
density map.
SCNN Regressors
● Each regressor has a
different receptive field that
evaluates crowd density.
● Uses mean inter-head
distance as a proxy for
crowd density.
SCNN Switch
● First 5 convolutional / max pooling
layers the same as VGG16
● Followed by Global Average Pooling
layer (GAP) and 2 fully connected layers
○ Similar to the final stages of
ResNet-50
○ GAP minimizes overfitting
● Finally, softmax to classify the image
patch to a regressor
SCNN Algorithm
● There are 3 main training
stages
○ Pretraining
○ Differential training
○ Coupled Training
Regressor
Pretraining
● The 3 Regressors are each
pretrained on the full
training dataset to learn
initial features that will be
fine-tuned in later stages.
● Uses Least Squares Error
(LSE / L2-norm) to
minimize the Euclidean
distance between the
Regressor output and the
given density map.
Number of training samples Regressor output
Density map of ground truth for given training sample
Differential
Training
● Backpropagation is done
with the same L2-norm loss
on density maps as in
pretraining.
● However, the choice of
which Regressor to
backpropagate on is
determined by count error.
Coupled Training
● Alternate training the
switch and back
propagating on the chosen
regressor for each epoch
● This is so that the
regressors and the switch
are co-adapted to the
training input
Evaluation
Part A Part B
Method MAE MSE MAE MSE
Zhang et al. 181.8 277.7 32.0 49.8
MCNN 110.2 173.2 26.4 41.3
SCNN 90.4 135.0 21.6 33.4
DEMO
Conclusions and Further Work
● This method can have good results on density, but it almost always undercounts actual people
unless there is some other object that it recognizes, like trees, flags, open sky…
● To be truly useful, it would probably have to be trained with given perspectives so that it could
eliminate non human objects from its recognition.
○ This could occur in a security scenario, where you would have fixed video perspectives, but it
would require a lot of work to create ground truths.
● Future work to refine this model
○ Modify the switch architecture
○ For live video input, use a different algorithm that chooses the regressor beforehand, see what
impact this has on counts
○ Examine the difference in counts between whole input images and patched ones
References
[1] B. A. Bansal and K. Venkatesh. People counting in high density crowds from still images. 2015.
[2] D. B. Sam, S. Surya, and R. V. Babu. Switching convolutional neural network for crowd counting. CoRR,abs/1708.00199,
August 2017.
[3] Ryan, David, Denman, Simon, Sridharan, Sridha, & Fookes, Clinton B. (2015) An evaluation of crowd counting methods,
features and regression models. Computer Vision and Image Understanding, 130, pp. 1-17.
[4] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma. Single image crowd counting via multi-column convolutional neural
network. CVPR IEEE, 10.1109/CVPR.2016.70, Jun 2016.
[5] C. Zhang, H. Li, X. Wang, and X. Yang. Cross-scene crowd counting via deep convolutional neural networks. In
CVPR,2015.
[6] Goodier, R. (2011). The Curious Science of Counting a Crowd. [online] Popular Mechanics. Available at:
http://www.popularmechanics.com/science/a7121/the-curious-science-of-counting-a-crowd/ [Accessed 25 Nov. 2017]