Detecting Nonexistent...

Detecting Nonexistent Pedestrians

Jui-Ting Chien , Chia-Jung Chou , Hwann-Tzong ChenNational Tsing Hua University, Taiwan

{ydnaandy123,jessie33321}@gmail.com, [email protected]

Abstract

We explore beyond object detection and semantic seg-mentation, and propose to address the problem of estimat-ing the presence probabilities of nonexistent pedestriansin a street scene. Our method builds upon a combinationof generative and discriminative procedures to achieve theperceptual capability of figuring out missing visual infor-mation. We adopt state-of-the-art inpainting techniques togenerate the training data for nonexistent pedestrian detec-tion. The learned detector can predict the probability ofobserving a pedestrian at some location in the current im-age, even if that location exhibits only the background. Weevaluate our method by inserting pedestrians into the imageaccording to the presence probabilities and conducting userstudy to distinguish real and synthetic images. The empir-ical results show that our method can capture the idea ofwhere the reasonable places are for pedestrians to walk orstand in a street scene.

1. IntroductionHumans are good at inferring implicit information from

images based on the context. For instance, with sufficientfamiliarity of urban street scenes, humans know where tolook at to find pedestrians. It implies that peripheral cues inthe scene other than the pedestrians themselves are usefulfor pedestrian detection. The ability to explore and exploitthose cues would be important for algorithms to achievehuman-level scene understanding and behavior analysis.

Our goal is to address the problem of predicting wherethe pedestrians are likely to appear in a scene. Such a taskis new and different from the conventional setting of pedes-trian detection in that the target locations might not containany pedestrians. Real and synthetic datasets of urban scenes[2, 3, 8, 9] are popular and useful for training deep models tosolve object detection and semantic segmentation, but can-not be directly used for our task. We generate new trainingdata that enforce our method to learn how to infer possiblepresence locations of pedestrians using implicit contextualcues rather than the features on the pedestrians.

Figure 1. Top: The input image and the predicted heatmap. Thelikelihoods of head and foot positions are depicted in red and blue.Bottom: The synthesized image with phantom pedestrians.

2. Method

The proposed pipeline has three parts: i) generatingtraining data for nonexistent-pedestrian detection, ii) learn-ing to predict possible locations of pedestrians, iii) synthe-sizing new images with phantom pedestrians for qualitativeevaluation. See Fig. 1 for an illustration of the pipeline.

Training data. From [2] we select a wide range of urban-scene images that contain at least one pedestrian. Eachimage has the ground-truth segmentation of pedestrians.We use state-of-the-art inpainting methods [6, 11] to re-move all pedestrians and create a set of ‘background’ im-ages. Based on the synthetic backgrounds, the originalimages, and the ground-truth segmentations, we generatevarious training data exhibiting different combinations ofremoved/non-removed pedestrians and the correspondingheatmaps of possible pedestrian presences, as in Fig. 2.

Learning. We adopt the methods of [5, 10] for learningto detect nonexistent pedestrians. We also propose a newmodel by combining FCN [10] with a discriminator. Theidea of including a discriminator is inspired by the genera-tive adversarial networks [1, 4, 7]. Based on [10], we seek

1

Figure 2. Generating training data. Left: The original image. Middle two: Some pedestrians are removed from the original image andinpainted with backgrounds. Right: The corresponding heatmap over the inpainted image as the ground truth for training.

Figure 3. Experimental results: The predicted locations of nonexistent pedestrians, visualized as heatmaps.

Figure 4. More examples of the synthesis pipeline. We adjust theimage brightness for better visualization. From top to bottom:input images, predicted heatmaps, and synthesized images withphantom pedestrians according to the predicted heatmaps.

to generate a distribution of pedestrian locations not onlycoherent with the input context, but also realistic enough todeceive the discriminator.

Synthesis. Our method automatically synthesizes a newtest image according to the predicted nonexistent-pedestrianheatmaps. More specifically, for each test image, we ob-tain three heatmaps modeling the head and feet locations of

pedestrians. We perform non-maximum suppression on theheatmaps and propose several candidate pedestrian poses.We search for the most likely pedestrian poses in the train-ing dataset, and render the output image by copy-and-paste.More examples of the synthesis pipeline are shown in Fig. 4.

3. Experimental Results and Conclusion

Given unseen images, the trained model performs wellon mimicking human perceptual ability. In Fig. 3, we high-light some logical patterns that can be observed in the gener-ated images: i) Sidewalks, safety islands, and bus stops areoften assigned with high probabilities of pedestrian pres-ence, even if the scene is void of pedestrians. ii) The timingis right: The ‘phantom’ pedestrians is inclined to cross thestreet when there is no car. iii) People tend to form groups.iv) Depth and perspective are correct: The ‘sizes’ of high-response areas in the heatmap are in accordance with thedepth and vanishing point.

To verify the quality of the synthesized images, we con-duct user study and ask the viewers to distinguish real andsynthesized images. Our method significantly outperformsthe baseline, which randomly selects a location and a scaleto add a pedestrian into the scene.

This work shows a possibility of using deep networks toinfer a scene from the implicit cues for nonexistent-objectdetection, which is a relatively unexplored area and has nostandard mechanisms for quantitative evaluation. The pro-posed idea suggests an alternative direction toward sceneunderstanding and may be further applied to other tasks.

References[1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. CoRR,

abs/1701.07875, 2017.[2] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-

nenson, U. Franke, S. Roth, and B. Schiele. The cityscapes datasetfor semantic urban scene understanding. In CVPR, pages 3213–3223,2016.

[3] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics:The KITTI dataset. I. J. Robotics Res., 32(11):1231–1237, 2013.

[4] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C.Courville. Improved training of wasserstein gans. CoRR,abs/1704.00028, 2017.

[5] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks forhuman pose estimation. In Computer Vision - ECCV 2016 - 14thEuropean Conference, Amsterdam, The Netherlands, October 11-14,2016, Proceedings, Part VIII, pages 483–499, 2016.

[6] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros.Context encoders: Feature learning by inpainting. In CVPR, pages2536–2544, 2016.

[7] A. Radford, L. Metz, and S. Chintala. Unsupervised Representa-tion Learning with Deep Convolutional Generative Adversarial Net-works. ArXiv e-prints, Nov. 2015.

[8] S. R. Richter, V. Vineet, S. Roth, and V. Koltun. Playing for data:Ground truth from computer games. In ECCV, pages 102–118, 2016.

[9] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez.The SYNTHIA dataset: A large collection of synthetic images forsemantic segmentation of urban scenes. In CVPR, pages 3234–3243,2016.

[10] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networksfor semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell.,39(4):640–651, 2017.

[11] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li. High-Resolution Image Inpainting using Multi-Scale Neural Patch Synthe-sis. ArXiv e-prints, Nov. 2016.

Date post:	20-Jun-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Detecting Nonexistent...

Documents