[IEEE Signal Processing (WCSP 2011) - Nanjing, China (2011.11.9-2011.11.11)] 2011 International...

3D Head Pose Estimation Using the KinectFarid Abedan Kondori, Shahrouz Yousefi, Haibo Li, Samuel Sonning, Sabina Sonning

Department of Applied Physics and ElectronicsUmea UniversityUmea, Sweden

{farid.kondori, shahrouz.yousefi, haibo.li, samuel.sonning, sabina.sonning}@tfe.umu.se

Abstract—Head pose estimation plays an essential role forbridging the information gap between humans and computers.Conventional head pose estimation methods are mostly donein images captured by cameras. However accurate and robustpose estimation is often problematic. In this paper we presentan algorithm for recovering the six degrees of freedom (DOF)of motion of a head from a sequence of range images takenby the Microsoft Kinect for Xbox 360. The proposed algorithmutilizes a least-squares minimization of the difference betweenthe measured rate of change of depth at a point and the ratepredicted by the depth rate constraint equation. We segment thehuman head from its surroundings and background, and thenwe estimate the head motion. Our system has the capability torecover the six DOF of the head motion of multiple people in oneimage. The proposed system is evaluated in our lab and presentssuperior results.

I. INTRODUCTION

3D head pose estimation is a challenging problem in human-computer interaction field due to pose variations, illuminationand complexity of the backgrounds. Conventional head poseestimation methods incorporate images taken by cameras asthe input. Appearance template methods use image-basedcomparison metrics to match a new image of a head to a setof exemplars with corresponding pose label in order to findthe most similar view [1], [2]. Some methods use the locationof features such as the eyes, mouth, and nose tip to determinethe head pose from their relative configuration [3], [4].Tracking methods estimate the global pose change of the headfrom the observed movement between video frames [5], [6].These methods involve extracting keypoints in the image, suchas scale-invariant feature transform (SIFT) [7] to recover themovement from frame to frame. However these methods sufferfrom illumination changes. Range images, on the contrary, arewell known to be robust against illumination variations in theenvironment and can be considered as a solution. In additiondealing with multiple user tracking where heads of differentpeople overlap with each other or are occluded by other objectsis still an issue [8], [9], [10].RGB image based approaches encounter difficulties in re-trieving the head. Employing depth information will substan-tially enhance the head retrieval since individual heads arediscriminated from each other due to the knowledge of theircorresponding depths. In the past few years, researchers hasfocused on using time-of-flight range cameras(TOF). Theyhave proposed different algorithms to address the problemof pose estimation and human motion capture from range

Fig. 1. The Microsoft Kinect: (A) laser projector, (B) RGB camera, (C),monochrome CMOS camera.

images [11], [12], [13], [14]. Although these methods haveacceptable performance, they are limited in the sense that theall six DOF of the head motion cannot be recovered.Considering all the mentioned drawbacks in previous imple-mentations, in this paper we present a novel approach forrecovering the six DOF of the head motion. Our system detectsand localizes the head using depth information obtained bythe Kinect. First the background is subtracted and then theusre’s head is detected by the system. Eventually head’s depthinformation will be used to estimate the head pose. Theproposed approach is much less complex and can handle themultiple person situations. Moreover, our system is robust toillumination changes and object overlapping problem, whichcan degrade the system performance. The method is evaluatedon a test dataset taken in our lab using the Kinect and achievesexcellent results.The paper is organized as follows: Section 2 gives an overviewof the Microsoft Kinect. System description is presentedin Section 3. Section 4 describes the head detection andsegmentation algorithm. Then our 3D head pose estimationmethod is discussed in Section 5. Results are given in Section6 and finally we present our conclusions in Section 7.

II. THE MICROSOFT KINECT

The Microsoft Kinect is a peripheral device for the Xbox360. It is used to obtain depth estimates using a structuredlight pattern. The device consists of a multi-array microphone,a RGB camera, a monochrome CMOS camera, and an infraredlaser projector(Fig. 1).The laser projector produces a structured light pattern in thescene, which is imaged by the CMOS camera. The displace-ment of the CMOS camera relative to the laser projectorresults in computing the distance to objects in the scene usingtriangulation. The device is capable of outputting RGB, andrange images with 640× 480 pixels at 30 frames per second.Microsoft has released a non-commercial Kinect software

978-1-4577-1010-0/11/$26.00 ©2011 IEEE

Fig. 2. System overview.

development kit (SDK) [15] for Windows. It provides Kinectcapabilities to developers who build applications with C++,C#, or Visual Basic by using Microsoft Visual Studio 2010. Inaddition, open source drivers in the form of the libfreenect [16]library are available and can be used to interface with thedevice. Approximate formulae for converting the Kinect depthmap to metric distances are also available [17].

III. SYSTEM DESCRIPTION

This Section presents an overview of the main steps inthe proposed approach, which is demonstrated in Fig. 2.Implementation details are presented in Section 3 throughSection 5.Given an input depth array, we first reduce noise and smooththe array for further process. Then a 3-stage head detectionprocess is used to locate the user’s head. Since we areonly interested in moving objects, we perform backgroundsubtraction to isolate the foreground. Afterwards, in order tofind distinct objects, the foreground is passed through ouralgorithm. Finally we discard irrelevant candidate segmentsand locate the user’s head. When the head is located inone frame, our system keeps track of it in next frames.Consequently the head does not need to be identified againin coming frames. Once we have segmented the head in towconsecutive frames, we recover six DOF of the head motionusing our algorithm. Eventually, the head motion parameterscan be used to facilitate different applications, such as human-computer interaction.

IV. MULTIPLE HEAD DETECTION AND TRACKING

Head detection algorithm is composed of different steps,as shown in Fig. 3. After smoothing the depth array andreducing the noise, the raw depth values should be converted tometric values. The raw depth data will be converted to metricvalue between 0.6 and 6 meters according to the formulagiven by Stphane Magnenat [17]. In the next step the deptharray is subtracted from the background. We assume that theprior knowledge about the background is available. This canbe considered as an initialization step, where we extract thebackground depth data. A difference matrix is computed bycomparing the original depth array with the background. Ifthe difference is below a threshold, the pixel is set to zerootherwise it will be retained, resulting in a matrix containingthe depth information of the foreground.Then segmentation is performed through a depth-first algo-rithm. A pixel is in the same segment as its neighbor if thedepth difference between them is less than a threshold (0.1-0.2m). Any segment that contains fewer pixels than a particularnumber is considered as non-human and discarded. Given asegment the system also needs to locate the head. This is

Fig. 3. Head detection scheme.

accomplished by finding the topmost pixel of the segment,estimating the height of the head, and finding the leftmostand rightmost pixels within a certain area belonging to thesegment. These four pixels constitute the boundaries of therectangle containing the head (Fig. 4).In order to perform head tracking between frames, the meanx, y and depth values for each segment in one frame arestored and compared with those in the next frame. If a similarsegment is found between frames, they are regarded as thesame segment.

V. 3D HEAD POSE ESTIMATION

The time-varying depth map from the Kinect can be viewedas a function of the form Z(X,Y, t). Taking a full timederivative of Z via the chain rule, the following equation isobtained

dZ

dt=

∂Z

∂X

dX

dt+

∂Z

∂Y

dY

dt+

∂Z

∂t(1)

This can be written in the form

Z = pX + qY + Zt

The above equation will be called the depth rate constraintequation, where the three partial derivatives of Z are denotedby

p =∂Z

∂X, q =

∂Z

∂Y, and Zt =

∂Z

∂t

Fig. 4. Head localization.

and the components of velocity of a point in the depth imageare given by

X =dX

dt, Y =

dY

dt, and Z =

dZ

dt

The values of the partial derivatives p, q, and Zt can beestimated at each pixel in the depth map, while X , Y , and Zare unknown.There is one such equation for every point in the segmenteddepth map corresponding to the head, so that if it containsn points, there are n equations in a total of 3n unknowns.The system of equations is extremely underconstrained andadditional assumptions are necessary to provide a uniquesolution. In the above discussion no constraint on the motionof neighboring points was assumed, each point being ableto move completely independently. Although, in most realmotions, neighboring points within the head do have similarvelocities. Horn and Harris have shown [18] that there is away to increase the amount of constraint. In analogy with a so-called direct method for recovering motion from an ordinaryimage sequence [19], we could assume that the sensor is rigidand that we have to recover the motion of the head relative tothe sensor. In this case, there are only six degrees of freedomof motion to recover, so that the corresponding system ofequations is now vastly overconstrained. Let R = (X,Y, Z)T

be a vector to a point on the head. If the head moveswith instantaneous translational velocity t and instantaneousrotational velocity ω with respect to the sensor, then the pointR appears to move with a velocity

dR

dt= −t− ω ×R (2)

with respect to the sensor [20]. The components of the velocityvectors are given by

t =

UVW

and ω =

ABC

Rewriting the equation for the rate of change of R in compo-nent form yields

X = −U −BZ + CY (3)Y = −V − CX +AZ (4)Z = −W −AY +BX (5)

where the dots denote differentiation with respect to time.Substituting these expanded equations into the depth rateconstraint equation itself yields

pU + qV −W + rA+ sB + tC = Zt (6)

where

r = −Y − qZ, s = X + pZ, and t = qX − pY

If there are n pixels in the head area, the resulting n equationscan be written in a matrix form as

p1 q1 −1 r1 s1 t1p2 q2 −1 r2 s2 t2...

......

......

......

......

......

...pn qn −1 rn sn tn

︸︷︷︸

A

UVWABC

︸︷︷︸

x

=

(Zt)1(Zt)2

...

...(Zt)n

︸︷︷︸

b

(7)

or Ax = b. The pixels are numbered from 1 to n as denotedby the subscripts. The above matrix equation correspondsto n linear equations in only six unknowns (namely U , V ,W , A, B, and C). Rather than arbitrarily choosing six ofthe equations and solving the resulting set of equations, aleast-squares error minimization technique is employed. Theleastsquares solution that minimizes the norm ‖Ax − b‖2satisfies the equation

ATAx = ATb (8)

Consequently, by solving the final matrix equation and com-puting the matrix x , the six DOF of the head motion will berecovered.

VI. IMPLEMENTATION

The proposed 3D head pose estimation method has beenimplemented and tested in our lab with a quad core Inteli7 at 3.4 GHz. Our system is fast enough to operate in realtime applications. The computational time for head detectionblock is about 15-25 ms and 10-15 ms for 3D pose estimationmethod. Consequently the total processing time is about 25-40ms which approximately corresponds to more than 25 framesper second. The head detection algorithm is tested on a set of200 range images and the correct detection rate is almost 96%.We also tried to implement OpenCV’s Haar feature based facedetection [21], using the Kinect’s RGB camera, to facilitatethe head detection process, but we faced two major problems.First, this method is computationally expensive and takes about45-70 ms to detect the human face. Second, it turned out thatthe user is limited to perform small head rotations to keephis/her face in front of the camera to have acceptable facedetection rate. In the other words, if the user rotates his/herhead more than a particular angle, which is natural in mostreal applications, the Haar feature based face detector fails tolocate the face.Due to the fact that there is no ground truth available, anexperiment is designed to evaluate the system performance. Inthis experiment the user’s head is detected and located in therange images by the system and then the six DOF of the headmotion are recovered and used to manipulate a 3D model onthe computer screen. As it is shown in Fig. 5, the position andorientation of the cubes is updated whenever the user moveshis head. Our experiments revealed that the effective distancefrom the sensor ranges from 0.6 up to 6 meters.

Fig. 5. Experimental results. The 3D head motion parameters are estimatedto update the position and orientation of the 3D model. First row is the initialposition. Next three rows show the rotation in roll, pitch, and yaw respectively.The last three ones illustrate the translation in x, y, and z axes.

VII. CONCLUSION

In this paper, we propose a real time 3D head pose es-timation method that uses the depth images obtained fromthe Kinect for Xbox360. Our experimental results reveal thatour approach can effectively detect the human head in allposes and appearances from the depth array, and it providesan accurate estimation of the six DOF of the head motion.In addition our method is able to estimate the 3D head poseof multiple people in one image. Moreover our algorithm isidentity and lighting invariant and works across all identitieswith the dynamic lighting condition. It can apply to near-fieldand far-field images with both high and low resolution as well.Although the experiments show the robustness and efficiencyof the system, our method has high dependency on accuratehead detection, which implies that if the person is wearinga strange shape hat, or unusual hair style, it probably willdegrade the head detection performance. Since the kinect is

equipped with the RGB camera, the integration of the RGBimages into the depth arrays is planned as future work toimprove the head detection algorithm.

REFERENCES

[1] J. Ng and S. Gong, Composite support vector machines for detection offaces across views and pose estimation, Image and Vision Computing,vol. 20, No. 5-6, pp. 359-368, 2002.

[2] J. Ng and S. Gong, Multi-view face detection and pose estimation usinga composite support vector machine across the view sphere, in Proc.IEEE Int’l. Workshop Recognition, Analysis, and Tracking of Faces andGestures in Real-Time Systems, 1999, pp. 14-21.

[3] J.-G. Wang and E. Sung, EM enhancement of 3D head pose estimatedby point at infinity, Image and Vision Computing, vol. 25, no. 12, pp.1864-1874, 2007.

[4] H. Wilson, F. Wilkinson, L. Lin, and M. Castillo, Perception of headorientation, Vision Research, vol. 40, no. 5, pp. 459- 472, 2000.

[5] S. Ohayon and E. Rivlin, Robust 3D head tracking using camera poseestimation, in Proc. IEEE Int’l. Conf. Pattern Recognition, 2006, pp.1063-1066.

[6] G. Zhao, L. Chen, J. Song, and G. Chen, Large head movement trackingusing SIFT-based registration, in Proc. Int’l. Conf. Multimedia, 2007,pp. 807-810.

[7] D. Lowe, Distinctive image features from scale-invariant keypoints, Int’lJ. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.

[8] P. Perez, C. Hue, J. Vermaak, and M. Gangnet, Color-based probabilisticTracking, Proceedings of European Conference on Computer Vision,Volume 1, 2002, 661-675.

[9] N.T. Siebel and S. Maybank, Fusion of Multiple Tracking Algorithmssfor Robust People Tracking, Proceedings of European Conference onComputer Vision, 2002, 373-387.

[10] T. Zhao and R. Nevatia, Tracking Multiple Humans in Complex Situa-tions, IEEE Transactions on Pattern Analysis and Machine Intelligence,Volume 26, Issue 9, 2004, 1208-1221.

[11] V. Ganapathi, C. Plagemann, D. Koller, S. Thrun, Real time motioncapture using a single time-of-flight camera, Proceedings of CVPR 2010.pp.755 762.

[12] HP. Jain and A. Subramanian, Real-time upper-body human pose esti-mation using a depth camera, In HP Technical Reports, HPL-2010-190,2010.

[13] J. Rodgers, D. Anguelov, H.-C. Pang, and D. Koller, Object posedetection in range scan data, In Proc. of IEEE Conf. on ComputerVision and Pattern Recognition (CVPR), 2006.

[14] Y. Zhu, B. Dariush, and K. Fujimura, Controlled human pose estimationfrom depth image streams, Proc. CVPRWorkshop on TOF ComputerVision, June 2008.

[15] http://research.microsoft.com/en-us/um/redmond/projects/kinectsdk/default.aspx[16] https://github.com/OpenKinect/libfreenect.[17] http://openkinect.org/wiki/Imaging Information.[18] Berthold K. P. Horn and John G. Harris, Rigid Body Motion from Range

Image Sequences, 1991.[19] B. K. P. Horn, and E. J. Weldon, Direct methods for recovering motion,

In. J. Comput. Vision 2, 1988. 5 1-76.[20] B. K. P. Horn, Robot Vision, MIT Press, Cambridge, MA & McCraw-

Hill, New York, 1986.[21] http://opencv.willowgarage.com/wiki/

Date post:	13-Dec-2016
Category:	Documents
Upload:	sabina
View:	214 times
Download:	2 times

[IEEE Signal Processing (WCSP 2011) - Nanjing, China (2011.11.9-2011.11.11)] 2011 International...

Documents