Post on 11-Feb-2017
transcript
Human Detection by Searching in 3D Space Using Camera and SceneKnowledge
Yuan Li, Bo Wu and Ram NevatiaUniversity of Southern California, Institute for Robotics and Intelligent Systems
Los Angeles, CA 90089-0273{yli8|bowu|nevatia}@usc.edu
AbstractMany existing human detection systems are based on
sub-window classification, namely detection is done byenumerating rectangular sub-images in the 2D imagespace. Detection rate of such approaches may be af-fected by perspective distortion and tilted orientation ofthe human in images. To overcome this problem with-out re-training the classifier, we develop a 3D searchmethod. A search grid is defined in the 3D scene. Ateach grid point a rectified sub-image is generated toapproximate the orthogonal projection of the target, sothat the distortion due to camera setting is reduced. Inaddition, 3D target position can be estimated from sin-gle camera data. Experiments on challenging data fromthe PETS2007 and CAVIAR INRIA datasets show signif-icantly improved detection performance of our approachcompared with the 2D search-based methods.
1 IntroductionAs an important problem in visual surveillance, hu-
man detection aims at finding all the humans in an im-
age. Among the large variety of detection methods, de-
tection based on sub-window classification [10][6] is an
important category with many high performance repre-
sentative human detection systems [11][3][8].
A sub-window classification based human detector
represents the target appearance by a rectangular im-
age patch with a pre-defined aspect ratio. Detection is
done by enumerating all such possible sub-windows in
the 2D image space. This works well when the cam-
era is distant, perspective distortion of the target is not
strong, and the target’s orientation is upright in images.
However, different camera settings may affect detection
performance. Figure 1(a) shows an example: humans in
the left top region become undetectable simply because
of the view angle and perspective effect.
Handling this problem at the classifier level is diffi-
cult and inefficient. We approach this problem by de-
veloping a new search strategy. Assuming that camera
(a) Input image fromPETS 2007 dataset [2].
(b) Our approach. (c) Image synthesizedwith detected 3D positionof objects.
Figure 1. Comparison of pedestrian detection results
in a scene with strong perspective effect.
settings can be estimated (which should be the case for
most surveillance situations), object search is performed
in the 3D world space instead of the 2D image space. A
3D scanning grid is created to cover all possible posi-
tions of objects in the scene. At each grid point, a rec-
tified sub-image is generated to approximate the object
appearance under orthogonal projection, so as to reduce
distortion caused by camera projection. This is done by
approximating the object by an imaginary planar sur-
face facing the camera and computing the homography
between the input image coordinates and the rectified
image coordinates. The classifier is then applied to the
rectified sub-image. Figure 1(b)(c) show the detection
result using our method.
Compared with the conventional 2D search method,
the benefits of our approach include: 1) The range of
viewpoint variation that the detector can deal with is en-
larged, i.e.3D search extends the detection ability be-
yond the training data of the classifier. 2) Contextual
knowledge such as pedestrian height and ground plane
is naturally integrated to constrain the search space and
rule out false alarms. 3) Detection results are interpreted
as 3D world coordinates, which can be used for tracking
and other further processes. 4) It can be easily combined
with any sub-window classification based detector.
2 Related workIn the recent decade, intense research interest in clas-
sification based object detection has brought forward nu-
merous detection systems [10][6][11][3]. Since our em-
978-1-4244-2175-6/08/$25.00 ©2008 IEEE
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 2, 2009 at 14:53 from IEEE Xplore. Restrictions apply.
Radialundistortion
3D searching grid
Sub-image rectification Detection Result
interpretation
Figure 2. Approach overview.
phasis is not on building a classifier, we will focus on the
design of search strategies and post-processes.
To search for objects in images, the most widely
adopted way is to enumerate all possible sub-windows
in the 2D image space. To further eliminate seman-
tically meaningless false detections, post-processes are
designed to utilize scene knowledge. Assuming that ob-
jects are on a ground plane, the 2D height of an object
can be modeled as a function of its position in image to
reduce false alarms [5]. [4] uses ground plane estima-
tion with surface orientation classification to refine de-
tection results. To reduce the search space during detec-
tion, [7] computes a homography between the head top
points and foot points and combine it with background
substraction to get a subset of foot positions for detec-
tion. A similar approach is adopted in [9]. But the ba-
sic assumption is still that pedestrians appear upright in
the image and are viewed by a distant camera. All these
methods can reduce the false alarm rate by applying con-
straints on object size and position, but they cannot im-
prove detection rate because they do not attempt to ad-
just the input image to compensate for target appearance
variations caused by camera settings.
3 Object detection in 3D search spaceFigure 2 shows the block diagram of our approach. In
the following, we describe each step in detail.
At the first step, we eliminate radial distortion of the
input image to obtain image I , so that from now on we
only operate on image I without radial distortion. Three
coordinate systems are involved: 2D coordinates of the
radial undistorted image frame I , 2D coordinates of the
rectified sub-image A (on which detection is performed)
and 3D coordinates in the world frame W . We use
homogeneous coordinates in I , A and W , and denote
points in them by p = (Iu, Iv, 1)T , q = (Au, Av, 1)T
and P = (x, y, z, 1)T respectively. Let the mapping
from W to I obtained from camera parameters be
p = MP. (1)
The search space is defined as a set of points in 3D
world coordinates. For humans, we assume that they
Figure 3. 3D searching grid of a scene from the
PETS 2007 dataset. Left to right: original image, auto-
matically generated searching grid, searching grid with
scene knowledge). Each yellow line approximates a hu-
man standing at a grid point.
stand on a ground plane (other possible positions can
also be included according to the scene). Search is per-
formed on a discrete planar grid in 3D (Figure 3). The
3D search range can be determined automatically by
constraining visibility and the minimal size of objects in
images, or it can be refined by adding scene knowledge.
3.1 Generating rectified sub-imagesHuman detectors based on sub-window classification
are commonly trained on size-fixed images of humans
viewed from a distant camera. To get best detection re-
sult, the sub-images input to the classifier should also be
obtained from a similar view point as the training im-
ages. Since camera settings may vary for different sites,
images need to be rectified before input for classifica-
tion.
Given the camera position at Pc, at each point Po of
the 3D searching grid, we generate a rectified sub-image
A so that if a human stands at Po, its appearance in Aapproximates its appearance under an orthogonal projec-
tion. To achieve this, we define the rectified image plane
A as parallel to z axis and perpendicular to PcPo’s pro-
jection on the x-y plane. Figure 4 shows an overview
of the relationship among these geometric entities: (a)
shows the scene in 3D, and image I is shown in more
detail in (b) and (c). The human is illustrated as a cylin-
der for the purpose of illustration. The visible part of the
cylinder side surface I is marked in blue.
When the camera is not very close, the orthogonal
view of the cylinder can be obtained by warping the blue
region in I . To avoid non-linear warping, we simplify
the model by assuming that the human is a rectangular
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 2, 2009 at 14:53 from IEEE Xplore. Restrictions apply.
Image I
Rectifiedsub-image A
Camera
World frame Wx
yz
Au
Av
POP1
P2
P3
P4
q1q2
q3
q4
qO
Image I Iu
Iv
Iu
IvP = q
p = P
pOp1
p2
p3p4
PC
H
(a)
(b)
(c)
Figure 4. Relationship among image I , rectified sub-
image A and an object in world frame W .
surface H (P1P2P3P4) parallel to A with its bottom
center located at Po. The projection of P1P2P3P4 in
image I (p1p2p3p4 in Figure 4(c)) is a reasonable ap-
proximation of the blue region in Figure 4 (b) if the angle
between PcPo and the x-y plane is not very large; if the
camera is over the top of the object, it would be impos-
sible to recover the object’s frontal orthogonal view.
Under this imaginary object plane approximation, the
object’s projection in image I can be transformed to the
approximate orthogonal view in the rectified image A as
follows. The angle between A and the x-z plane in the
world frame can be computed by
θ =π
2− arccos(
(xc, yc)(xo, yo)T
|(xc, yc)||(xo, yo)|), (2)
where Pc = (xc, yc, zc, 1)T is the camera position (es-
timated from M if not known directly), and Po =(xo, yo, 0, 1)T is the search grid point.
The mapping between the imaginary object plane
P1P2P3P4 and the rectified image A can be written
as
P = T q
=
⎛⎜⎜⎝
cos θ 0 −Auo cos θ + xo/αsin θ 0 −Auo sin θ + yo/α
0 −1 Avo
0 0 1/α
⎞⎟⎟⎠q, (3)
where qo = (Auo,Auo, 1)T is the desired projected po-
sition of Po in A. α is the ratio between the real world
object size and the image patch size which the detec-
tor operates on (e.g., for a 1.8 meter tall human which
is normalized to an image patch of 60 pixel in height,
α = 1.8/60). Image A can therefore be generated from
image I by the homography
p = MP = MT q. (4)
x
yzx
y
z
Camera
Figure 5. Orientation of imaginary plane of pedes-
trian at each grid point (left: viewed from the camera of
the scene; right: another different view).
Original input Rectified: � 30� � o 60� � o
Figure 6. Choice of the orientation of the rectified im-
age projection plane A (or the imaginary object plane
H) affects the appearance of the rectified sub-image.
Sub-images shown here are computed using orientation
calculated in our approach (θ) and other alternatives.
Figure 5 is an example of the imaginary object plane
at each search grid point in a given scene from the PETS
2007 data. Figure 6 shows how different choices of θcan result in different rectified images. We can see that
our approach can approximate the appearance of the ob-
ject under an orthogonal view and with a standard size,
which is suitable for input to a sub-window classifier.
3.2 Result interpretation and post processingIf an object is detected in a certain rectified image
A at position qo, we can use Equation 3 to estimate its
3D position Po. Multiple detection responses may be
present around one object, therefore agglomerative clus-
tering is done based on the 3D distances of the detec-
tion responses. Clustering using 3D distance is better
for crowded scenes because in such cases, detection re-
sponses of two objects may largely overlap in 2D, caus-
ing clustering in 2D to merge them.
4 ExperimentExperiments are done on subsets of the PETS 2007
and CAVIAR INRIA data 1, with quantitative compari-
son between our approach and detection by conventional
2D search.
4.1 Pedestrian detection in PETS 2007 dataWe integrate our approach with a Cluster Boosted
Tree detector trained using the method in [11]. The
1The images and groundtruth are available at http://iris.usc.edu/
˜yli8/data/PETS07 subset.zip and CAVIAR INRIA subset.zip.
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 2, 2009 at 14:53 from IEEE Xplore. Restrictions apply.
������������ ��������������� ������
����� ����������������������� ���
������������ ��������������� ������������������ ����� ������������ �����
Figure 7. Sample results on PETS 2007 dataset (all the false alarms of 3D search are marked in red).
training samples are collected from scenes in which
pedestrians are upright and the camera position is dis-
tant. No modification to the classifier itself is necessary
to integrate it with our approach.
For the PETS 2007 data, we test on the third view
which has an obvious perspective effect which im-
pairs detection performance of conventional 2D search
method. Since ground truth is not available, we ran-
domly select 540 frames (1794 humans) to perform a
quantitative comparison.
Figure 8. Typical miss detections in PETS 2007
dataset.
The detection rate and false alarm number are com-
puted for the 3D search and the conventional 2D search
using the same detector. Results are given in Table 1,
and Figure 7 shows some sample results. From the re-
sults before clustering we can see that the detection re-
sponse of 3D search is much stronger than that of 2D.
While the 3D search method is capable of detecting hu-
mans with different orientations and perspective distor-
tion, humans correctly detected by the detector using 2D
search are mostly around the upper-left part of the im-
ages, where the perspective effect is not strong. Also,
clustering using 3D distance can give better object infer-
ence in crowded scenes, as we have discussed in Section
3.2.
The overall detection rate is not high mainly due to
occlusion, low contrast and pose variation. Figure 8
shows some failure modes: profile view with walking
pose (marked by yellow circles), top-down view (blue),
occlusion in crowds (white) and poses like crouching
and bending (orange). The latter two are common dif-
ficulties for human detection. For the walking pose, the
rectified sub-image of a walking person is less accurate
in the lower body because the stretched leg is far from
our assumed planar surface of the human. Also, in the
region where the camera is pointing down over the head
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 2, 2009 at 14:53 from IEEE Xplore. Restrictions apply.
(lower right image region in the PETS examples), the de-
tection rate is low because it is hard to recover the frontal
orthogonal view from a near-overhead camera view.
������������ �����
�����������������������
������������ �����
����� ����������������������� ���
Figure 9. Sample results on CAVIAR INRIA dataset
(all the false alarms of 3D search are marked in red).
4.2 Pedestrian detection in CAVIAR INRIAdata
For the CAVIAR INRIA data [1], we calibrate the
camera using a semi-automatic calibration tool by label-
ing parallel lines from the buildings. The result indicates
that our method is quite robust to inaccuracy in camera
parameters.
Quantitative comparison on the CAVIAR INRIA
dataset is done on 706 randomly selected images (947
humans). The result of conventional 2D search is very
poor(“2D” in Table 2). For a more meaningful compar-
ison, we improve the 2D search by rotating the image
in 2D according to the tilt angle of human’s upright di-
rection at each scanning point. The result shows that
detection rate of 3D search is still more than twice of
that of 2D search plus in-plane rotation with compara-
ble false alarm rates (“2D+rotation” in Table 2), mean-
ing that only adjusting tilt angle in 2D is not sufficient.
Some sample results can be found in Figure 9.
5 Conclusion
We have proposed a novel 3D searching strategy in
object detection, which has the following advantages:
3D(L) 3D(H) 2D(L) 2D(H)
Detection rate 65.6% 48.4% 46.6% 25.7%
False alarms / frame 0.28 0.07 0.63 0.09
Table 1. Comparison on PETS 2007 dataset. (H) in-
dicates a higher threshold of detection confidence, (L)indicates a lower threshold.
3D 2D+rotation 2D
Detection rate 87.2% 38.3% 17.5%
False alarms / frame 1.23 1.54 1.75
Table 2. Comparison on CAVIAR INRIA dataset.
image rectification to reduce adversary effect of cam-
era view on detection performance, integration of scene
knowledge to reduce false alarms, flexibility to com-
bine with any patch-based detector, and the ability to
estimate object position in 3D. The output of 3D ob-
ject position can improve post-detection processes, and
hopefully will also integrate more naturally with existing
multi-view tracking algorithms.
6 AcknowledgmentsThis research is supported, in part, by the U.S. Gov-
ernment VACE program. Yuan Li is funded, in part, by
a Provost’s Fellowship from USC.
References[1] Caviar dataset. http://homepages.inf.ed.ac.uk/rbf/
CAVIARDATA1/.[2] Pets 2007 dataset. http://www.cvg.rdg.ac.uk/
PETS2007/.[3] N. Dalai and B. Triggs. Histograms of oriented gradients
for human detection. In CVPR, 2005.[4] D. Hoiem, A. A. Efros, and M. Hebert. Putting objects
in perspective. In CVPR, 2006.[5] B. Leibe, K. Schindler, and L. V. Gool. Coupled detec-
tion and trajectory estimation for multi-object tracking.
In ICCV, 2007.[6] S. Z. Li, L. Zhu, Z. Zhang, A. Blake, H. Zhang, and
H. Shum. Statistical learning of multi-view face detec-
tion. In ECCV, 2002.[7] Z. Lin, L. S. Davis, D. Doermann, and D. DeMenthon.
Hierarchical part-template matching for human detec-
tion and segmentation. In ICCV, 2007.[8] P. Sabzmeydani and G. Mori. Detecting pedestrians by
learning shapelet features. In CVPR, 2007.[9] P. Tu, N. Krahstoever, and J. Rittscher. View adaptive
detection and distributed site wide tracking. In AVSS,
2007.[10] P. Viola and M. Jones. Rapid object detection using a
boosted cascade of simple features. In CVPR, 2001.[11] B. Wu and R. Nevatia. Cluster boosted tree classifier for
multi-view, multi-pose object detection. In ICCV, 2007.
Authorized licensed use limited to: IEEE Xplore. Downloaded on February 2, 2009 at 14:53 from IEEE Xplore. Restrictions apply.