Download - CNN-based Crowd Counting Methodshji/cs519_slides/CNN-based Crowd Counting … · Cross-scene crowd counting via deep convolutional neural networks (Zhang et al. Model)(2015) This

CNN-based Crowd Counting Methods

Tannaz R.Damavandi Elinor Huntington

Introduction

Crowd counting has a wide range of applications that cross the boundaries of science and

engineering such as:

● Geopolitical and civic applications

● Crowd control and public safety

● Transportation systems design and traffic control

● Counting cells or bacteria on the microscopic level

Image source : http://www.robots.ox.ac.uk/~vgg/projects/seebibyte//images/Counting3.jpg

Imgae source https://i.kinja-img.com/gawker-media/image/upload/s--PgpCmwTr--/c_scale,fl_progressive,q_80,w_800/ezbhvc4qy5vdeebfwcgx.jpg

Introduction (Cont’d)

This challenging task needs to consider many factors such as inter-occlusion between people

and similarity among background features and crowds faces.

Image source: ShanghaiTech dataset

Background

Herbert Jacob’s method (1967)

Crowd Count = Avg. number of people in a section * Number of sections

Drawback:

● Crowds not distributed uniformly

Solution:

● Estimate the count for each patch and add all

these estimates together.

Image source : https://airphotoslive.com//wp-content/uploads/2013/04/crowd-counting-02.jpg

Background (Cont’d)

Most of the proposed automated models for crowd counting are not capable of handling

large crowds, especially when the number of people exceeds hundreds of thousands.

Three main crowd counting methods:

● Pixel-based analysis

○ Edge info and density map

(Zhang et al Model , MCNN , SCNN)

● Texture-based analysis

○ Fourier analysis

● Object level analysis

○ Locate individual in a scene

Cross-scene crowd counting via deep convolutional neural

networks (Zhang et al. Model)(2015)

This model is the precursor to MCNN and SCNN.

Model: ● 3 convolution layers.

● 3 fully connected layers.

● 2 Max pooling layers with a 2 × 2 kernel size.

● Activation function: ReLU

WorldExpo’10 crowd counting dataset was firstly introduced by Zhang et al. This dataset contains 1132 annotated video sequences

which are captured by 108 surveillance cameras, all from Shanghai 2010WorldExpo.

Zhang et. al model(Cont’d)

MCNN (Multi-Column Convolutional Neural Network)

Two natural configuration to crowd count using CNNs

1- Direct headcount

2- Density map of the crowd

MCNN is in favor of second model

Advantages:

● Features learned by each column are adaptive to variations in people/head size due to perspective

effect or image resolution.

● True density map is computed accurately based on geometry-adaptive kernels which do not need

to know the perspective map of the input image.

MCNN

Model: ● 3 parallel CNNs with different size of local receptive fields

● 2 Max pooling is applied for each 2×2 region.

● Activation function:Rectified linear unit (ReLU)

Data sets

Table 1 - Comparison of Shanghai Tech dataset with existing datasets: Num is the number of images; Max is the maximal crowd count;

Min is the minimal crowd count; Ave is the average crowd count; Total is total number of labeled people.

MCNN-Density Map via Geometry Adaptive Kernels

Accurate estimation of the crowd density

Homography between the ground plane and the image plane

The geometry of the scene

Uniform distribution of crowd around each head

Average KNN

Original images and corresponding crowd density maps obtained by convolving geometry-adaptive Gaussian kernels.

MCNN (Cont’d) Loss function :

𝛩 : a set of learnable parameters in the MCNN.

N : number of training image.

Xi :input image and

Fi :the ground truth density map of image Xi.

F(Xi; 𝛩) : estimated density map generated by MCNN

which is parameterized with 𝛩 for sample Xi.

L : loss between estimated density map and the ground truth density map.

The loss function can be optimized via batch-based stochastic gradient descent and

backpropagation.

SCNN

● Switching Convolutional Neural

Network

○ 3 small CNNs (aka Regressors)

○ 1 VGG16-based switch

● Images are patched

● Each patch is processed by the

switch and sent to one of the

Regressors

● Output is a density map

https://arxiv.org/pdf/1708.00199.pdf

Data

● Each input image is

patched into 9 smaller

images

● If training, the ground truth

is transformed into a

density map for model

output comparison

SCNN Regressors

● Based on the MCNN

Regressor architecture

● Four convolutional layers, 2

max pooling layers, and a

final 1 x 1 layer to

transform features into a

density map.

SCNN Regressors

● Each regressor has a

different receptive field that

evaluates crowd density.

● Uses mean inter-head

distance as a proxy for

crowd density.

SCNN Switch

● First 5 convolutional / max pooling

layers the same as VGG16

● Followed by Global Average Pooling

layer (GAP) and 2 fully connected layers

○ Similar to the final stages of

ResNet-50

○ GAP minimizes overfitting

● Finally, softmax to classify the image

patch to a regressor

SCNN Algorithm

● There are 3 main training

stages

○ Pretraining

○ Differential training

○ Coupled Training

Regressor

Pretraining

● The 3 Regressors are each

pretrained on the full

training dataset to learn

initial features that will be

fine-tuned in later stages.

● Uses Least Squares Error

(LSE / L2-norm) to

minimize the Euclidean

distance between the

Regressor output and the

given density map.

Number of training samples Regressor output

Density map of ground truth for given training sample

Differential

Training

● Backpropagation is done

with the same L2-norm loss

on density maps as in

pretraining.

● However, the choice of

which Regressor to

backpropagate on is

determined by count error.

Coupled Training

● Alternate training the

switch and back

propagating on the chosen

regressor for each epoch

● This is so that the

regressors and the switch

are co-adapted to the

training input

Evaluation

Part A Part B

Method MAE MSE MAE MSE

Zhang et al. 181.8 277.7 32.0 49.8

MCNN 110.2 173.2 26.4 41.3

SCNN 90.4 135.0 21.6 33.4

DEMO

https://drive.google.com/file/d/1WmiNALhkC4pHHCIPtphpv13V2sc9UqQu/view

https://drive.google.com/file/d/1FNEItvmHUzwmP3odvxSBDfj6t7_PG0h5/view

https://drive.google.com/file/d/1XhRz11ZKy66umj9u0ixCeU-uRGELq33N/view

https://drive.google.com/file/d/1oZwY-32joIjUdE5bpA4YFVIPi3kGdOsP/view

https://drive.google.com/file/d/1AdAcFS5JojFlWtySPKBHU-RoBaugnbjH/view

https://drive.google.com/file/d/1DSJ0arSNIQ1Hr_zsDfQTZKyGClPhluMA/view

https://drive.google.com/file/d/1h6vTijiXGH9Lk6BiiDBg6n_IbOtwU63o/view

https://drive.google.com/file/d/1rq-tjkESXSXH5jWmHlGHvrYHqDOJqXyT/view

https://drive.google.com/file/d/1-L0QN7oGsSFvXt0nW1pSs1UdNqpba5Bc/view

https://drive.google.com/file/d/1LfSo1NatITeYEFN1fZf6nR0h5k0phhog/view

https://drive.google.com/file/d/170oXifnspGJsjJWv3q0KGHo8v7kF_3uC/view

https://drive.google.com/file/d/1aypZc8cgnx21vVAdJFn2Kq9N0dGL3TMW/view

Conclusions and Further Work

● This method can have good results on density, but it almost always undercounts actual people

unless there is some other object that it recognizes, like trees, flags, open sky…

● To be truly useful, it would probably have to be trained with given perspectives so that it could

eliminate non human objects from its recognition.

○ This could occur in a security scenario, where you would have fixed video perspectives, but it

would require a lot of work to create ground truths.

● Future work to refine this model

○ Modify the switch architecture

○ For live video input, use a different algorithm that chooses the regressor beforehand, see what

impact this has on counts

○ Examine the difference in counts between whole input images and patched ones

References

[1] B. A. Bansal and K. Venkatesh. People counting in high density crowds from still images. 2015.

[2] D. B. Sam, S. Surya, and R. V. Babu. Switching convolutional neural network for crowd counting. CoRR,abs/1708.00199,

August 2017.

[3] Ryan, David, Denman, Simon, Sridharan, Sridha, & Fookes, Clinton B. (2015) An evaluation of crowd counting methods,

features and regression models. Computer Vision and Image Understanding, 130, pp. 1-17.

[4] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma. Single image crowd counting via multi-column convolutional neural

network. CVPR IEEE, 10.1109/CVPR.2016.70, Jun 2016.

[5] C. Zhang, H. Li, X. Wang, and X. Yang. Cross-scene crowd counting via deep convolutional neural networks. In

CVPR,2015.

[6] Goodier, R. (2011). The Curious Science of Counting a Crowd. [online] Popular Mechanics. Available at:

http://www.popularmechanics.com/science/a7121/the-curious-science-of-counting-a-crowd/ [Accessed 25 Nov. 2017]