+ All Categories
Home > Documents > An Efficient Convolutional Network for Human Pose Estimation · In recent years, human pose...

An Efficient Convolutional Network for Human Pose Estimation · In recent years, human pose...

Date post: 24-May-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
1
An Efficient Convolutional Network for Human Pose Estimation Umer Rafi 1 rafi@vision.rwth-aachen.de Ilya Kostrikov 1 [email protected] Juergen Gall 2 [email protected] Bastian Leibe 1 [email protected] 1 Computer Vision Group, RWTH Aachen University, Germany 2 Computer Vision Group, University of Bonn, Germany In recent years, human pose estimation has greatly benefited from deep learning and huge gains in performance have been achieved on popular benchmarks [1, 3, 4]. The trend to max- imise the accuracy on benchmarks, however, re- sulted in computationally expensive deep net- work architectures that require expensive hard- ware and pre-training on large datasets. In this work, we propose an efficient deep network ar- chitecture that can be efficiently trained on mid- range GPUs without the need of any pre-training and that is on par with much more complex mod- els on the benchmarks [1, 3, 4]. Our proposed Fully Convolutional GoogLeNet (FCGN) network (see Figure 1) is based on the network architecture from [2]. We take the first 17 layers of [2] and add a decon- volution layer to make it fully convolutional. In addition, we introduce a skip layer and combine two FCGNs with shared weights to obtain a multi-resolution network. Belief maps for each joint are then obtained by a deconvolution layer with large kernel size in combination with a sigmoid function for normalisation and spatial drop out for regularisation. We compare the performance of the pro- posed architecture against convolutional pose machines [5] on the well-known FLIC, LSP, and MPII benchmarks [1, 3, 4]. Our proposed net- work outperforms most previous approaches and achieves competitive performance to the more complex model of [5], while requiring only 3GB of memory and far less training time. [1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2D Human Pose Estimation: New BenchMark. In CVPR, 2014. [2] S. Ioffe and C. Szegedy. Batch normalization: Ac- celerating deep network training by reducing in- ternal covariate shift. 2015. [3] S. Johnson and M. Everingham. Clustered Pose Figure 1: (a) Proposed fully convolutional GoogLeNet (FCGN) (b) The proposed multi- resolution network combines two FCGNs. Figure 2: Our Qualitative results on FLIC [4], LSP [3] and MPII [1]. and Nonlinear Appearance Models for Human Pose Estimation. In BMVC, 2010. [4] B. Sapp and B. Taskar. MODEC : Multimodel De- composable Models for Human Pose Estimation. In CVPR, 2013. [5] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional Pose Machines. In CVPR, 2016.
Transcript
Page 1: An Efficient Convolutional Network for Human Pose Estimation · In recent years, human pose estimation has greatly benefited from deep learning and huge gains in performance have

An Efficient Convolutional Network for Human Pose Estimation

Umer Rafi1

[email protected]

Ilya Kostrikov1

[email protected]

Juergen [email protected]

Bastian Leibe1

[email protected]

1 Computer Vision Group,RWTH Aachen University,Germany

2 Computer Vision Group,University of Bonn,Germany

In recent years, human pose estimation hasgreatly benefited from deep learning and hugegains in performance have been achieved onpopular benchmarks [1, 3, 4]. The trend to max-imise the accuracy on benchmarks, however, re-sulted in computationally expensive deep net-work architectures that require expensive hard-ware and pre-training on large datasets. In thiswork, we propose an efficient deep network ar-chitecture that can be efficiently trained on mid-range GPUs without the need of any pre-trainingand that is on par with much more complex mod-els on the benchmarks [1, 3, 4].

Our proposed Fully ConvolutionalGoogLeNet (FCGN) network (see Figure 1) isbased on the network architecture from [2]. Wetake the first 17 layers of [2] and add a decon-volution layer to make it fully convolutional. Inaddition, we introduce a skip layer and combinetwo FCGNs with shared weights to obtain amulti-resolution network. Belief maps for eachjoint are then obtained by a deconvolution layerwith large kernel size in combination with asigmoid function for normalisation and spatialdrop out for regularisation.

We compare the performance of the pro-posed architecture against convolutional posemachines [5] on the well-known FLIC, LSP, andMPII benchmarks [1, 3, 4]. Our proposed net-work outperforms most previous approaches andachieves competitive performance to the morecomplex model of [5], while requiring only 3GBof memory and far less training time.

[1] M. Andriluka, L. Pishchulin, P. Gehler, andB. Schiele. 2D Human Pose Estimation: NewBenchMark. In CVPR, 2014.

[2] S. Ioffe and C. Szegedy. Batch normalization: Ac-celerating deep network training by reducing in-ternal covariate shift. 2015.

[3] S. Johnson and M. Everingham. Clustered Pose

Figure 1: (a) Proposed fully convolutionalGoogLeNet (FCGN) (b) The proposed multi-resolution network combines two FCGNs.

Figure 2: Our Qualitative results on FLIC [4],LSP [3] and MPII [1].

and Nonlinear Appearance Models for HumanPose Estimation. In BMVC, 2010.

[4] B. Sapp and B. Taskar. MODEC : Multimodel De-composable Models for Human Pose Estimation.In CVPR, 2013.

[5] S. Wei, V. Ramakrishna, T. Kanade, andY. Sheikh. Convolutional Pose Machines. InCVPR, 2016.

Recommended