+ All Categories
Home > Documents > Visual Tracking of Human Body with Deforming Motion - CiteSeerX

Visual Tracking of Human Body with Deforming Motion - CiteSeerX

Date post: 12-Feb-2022
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
6
Visual Tracking of Human Body with Deforming Motion and Shape Average Alessandro Bissacco UCLA Computer Science Los Angeles, CA 90095 [email protected] UCLA CSD-TR # 020046 Abstract In this work we present a novel approach for tracking human body in video sequences. We model the human skeleton as a kinematic chain of body parts which un- dergo a transformation composed of rigid motion and shape variation. Tracking is formulated as the prob- lem of minimizing a cost functional with respect to the unknown position and shape of the body parts. 1 Introduction The problem of tracking humans in video streams has received great attention in the last years. The recent availability of commercial hardware for the capture, transmission and processing of full resolution video data has opened a wide range of new applications in this domain. Image-based human tracking might play a promi- nent role in the next generation of surveillance systems and human computer interfaces. Systems for measur- ing human motion from video data can also constitute a valuable support for various fields, ranging from ki- nesiology, rehabilitation in biomechanics, to technical training in sports and performing arts. Estimating the pose of the human body in a video stream is a difficult problem because of the significant variations in the appearance of the object throughout the sequence. Illumination, viewing conditions, rela- tive position and orientation, self-occlusions, all con- tribute to make the task of matching human body parts between image frames remarkably difficult. In this work we propose an approach to visual track- ing that models the human skeleton as a kinematic chain of rigid links. We formulate the problem as the minimization of a cost functional with respect to rigid motion and shape of the body parts. 2 Related work Different approaches have been proposed for the prob- lem of visual tracking of human motion (see [1, 9] for a survey). They can be classified in two main types, depending on whether a priori models of the shape of the human body are used. Approaches that do not use shape models, such as in [21] and [14] usually rely on heuristic procedures to find correspondences of body parts between frames of video sequences. In [18] a variational approach exploiting motion information is used for detection and tracking of arbitrary objects. Model-based approaches can be divided in single view [12, 22, 25] and multi-view [7, 26, 8], 2D mo- tion models [15] and 3D models [20, 5]. Most of these approaches require manual initialization in the first frame. Soatto et al. [23] propose a framework for modeling the motion of deforming objects. A nonrigid transfor- mation is seen as the composition of a group action g on a particular object, on top of which a local defor- mation is applied. In this setting the notion of average shape is defined as the one that minimizes the defor- mations. Bregler and al. [24] have proposed a variety of methods to model non-rigid motions . Such models are built as linear combinations of a collection of “key” poses, learned using principal component analysis from motion capture data. Local representations of motion based on optical flow have been exploited in [3, 16], and view-based methods are proposed in [2, 10]. Other approaches are based on principal component analysis [27]. In [6] a mixed-state statistical model for the representation of motion has been proposed. In this Switching Linear Dynamic Model a stochastic finite-state automata at the highest level switches between local linear Gaus- sian models. Estimation and recognition is performed
Transcript

Visual Tracking of Human Body with Deforming Motion

and Shape Average

Alessandro BissaccoUCLA Computer ScienceLos Angeles, CA [email protected]

UCLA CSD-TR # 020046

Abstract

In this work we present a novel approach for trackinghuman body in video sequences. We model the humanskeleton as a kinematic chain of body parts which un-dergo a transformation composed of rigid motion andshape variation. Tracking is formulated as the prob-lem of minimizing a cost functional with respect to theunknown position and shape of the body parts.

1 Introduction

The problem of tracking humans in video streams hasreceived great attention in the last years. The recentavailability of commercial hardware for the capture,transmission and processing of full resolution videodata has opened a wide range of new applications inthis domain.

Image-based human tracking might play a promi-nent role in the next generation of surveillance systemsand human computer interfaces. Systems for measur-ing human motion from video data can also constitutea valuable support for various fields, ranging from ki-nesiology, rehabilitation in biomechanics, to technicaltraining in sports and performing arts.

Estimating the pose of the human body in a videostream is a difficult problem because of the significantvariations in the appearance of the object throughoutthe sequence. Illumination, viewing conditions, rela-tive position and orientation, self-occlusions, all con-tribute to make the task of matching human body partsbetween image frames remarkably difficult.

In this work we propose an approach to visual track-ing that models the human skeleton as a kinematicchain of rigid links. We formulate the problem as theminimization of a cost functional with respect to rigidmotion and shape of the body parts.

2 Related work

Different approaches have been proposed for the prob-lem of visual tracking of human motion (see [1, 9] fora survey). They can be classified in two main types,depending on whether a priori models of the shape ofthe human body are used.

Approaches that do not use shape models, such as in[21] and [14] usually rely on heuristic procedures to findcorrespondences of body parts between frames of videosequences. In [18] a variational approach exploitingmotion information is used for detection and trackingof arbitrary objects.

Model-based approaches can be divided in singleview [12, 22, 25] and multi-view [7, 26, 8], 2D mo-tion models [15] and 3D models [20, 5]. Most of theseapproaches require manual initialization in the firstframe.

Soatto et al. [23] propose a framework for modelingthe motion of deforming objects. A nonrigid transfor-mation is seen as the composition of a group action gon a particular object, on top of which a local defor-mation is applied. In this setting the notion of averageshape is defined as the one that minimizes the defor-mations. Bregler and al. [24] have proposed a varietyof methods to model non-rigid motions . Such modelsare built as linear combinations of a collection of “key”poses, learned using principal component analysis frommotion capture data.

Local representations of motion based on opticalflow have been exploited in [3, 16], and view-basedmethods are proposed in [2, 10]. Other approachesare based on principal component analysis [27]. In [6]a mixed-state statistical model for the representationof motion has been proposed. In this Switching LinearDynamic Model a stochastic finite-state automata atthe highest level switches between local linear Gaus-sian models. Estimation and recognition is performed

with expectation maximization approaches, using par-ticle filters [17, 4] or structured variational inferencetechniques [19].

3 Modeling Human Body Mo-tion

In this paper we focus on the problem of estimatingthe pose of a human body in a video sequence. Theultimate goal is to build a system that, if properly ini-tialized, can reliably track the configuration of an ar-ticulated object such as a human body from a sequenceof monocular images. We do not consider the issue ofmodel initialization, instead we assume that the con-figuration of the object in the first frame is given, forexample, by manual initialization.

We model the human skeleton as a kinematic chainof rigid bodies. Each body segment is represented bya rigid link with ellipsoidal or conic support, and thelinks are connected together by joints. To restrict theset of admissible motions and reduce the ambiguitiesin the estimation, we assume that each joint allows forone single degree of freedom, a rotation around its axis.This constraint is justified by the fact that typically themotion of the limbs in a walking gaits can be approx-imated as planar around an axis perpendicular to thedirection of walking.

Obviously the assumption of rigid object is not metby body parts. Various factors contribute to changethe appearance of the limbs in the sequence, such asillumination, viewing angle, occlusions, and so forth.Because of these large variations any standard tem-plate matching technique is doomed to fail if directlyapplied to this problem. Our solution is to model thetransformations undergone by body limbs as composi-tion of rigid motions and shape variations. We formu-late a cost functional written in terms of the positionof the links of the kinematic chain and their shape,represented by a weight function.

3.1 Exponential maps for motion rep-resentation

Among the possible representations of rigid motion, asensible choice for our application is the one based onexponential maps. In this parameterization, arbitrary3D motion can be encoded in a 6 dimensional vector ξ,called twist:

ξ =

vx

vy

vz

ωx

ωy

ωz

The twist describes a rotation around an arbitrary

axis in the space: the axis and amount of rotation is

given by the vector ω =

ωx

ωy

ωz

, the location of the

rotation axis and the amount of translation along thisaxis is given by the remaining three components v = vx

vy

vz

.

The matrix form G ∈ SE(3) of the rigid motionrepresented by the twist ξ is given by:

G = eξ , ξ =

0 −ωz ωy vx

ωz 0 −ωx vy

−ωy ωx 0 vz

0 0 0 0

Exponential maps have several advantages over

other parameterizations of 3D rotations. They do notsuffer the problem of singular configurations as the Eu-ler angles, and as opposed to quaternion or matrix rep-resentations it is not necessary to constrain a numberof parameters larger than the degrees of freedom to aset of admissible values.

Derivatives of exponential maps with respect to theirparameters can be computed in closed form but do nothave simple expressions. We refer the reader to [11] forthe details.

3.2 From kinematic chains to images

Given the coordinates of a point on a link of kinematicchain, we want to compute its projection on the imageplane. If the variation in depth of the points on thearticulated object is small compared to the distancefrom the camera, we can approximate the transforma-tion to scaled orthographic projection. This conditionsare generally met in video sequences of walking people.

In the following we consider the reference frames as-sociated with links centered on the joints and havingthe z axis oriented along the direction of the joint axis.Let po =

[xo yo zo 1

]T be the homogeneous co-ordinates of a point relative to the reference frame oflink l. The coordinates (x, y) of its projection onto theimage plane are given by:

[xy

]= gl(po,Θ) = gl(po, ξ1, ..., ξL, s) (1)

= sΠeξ1g12eξ2g23e

ξ3 · · · eξlpo (2)

where Π =[

1 0 0 00 1 0 0

],

ξ1 =[

vx vy vz ωx ωy ωz

]gives position and

orientation of the first link, gij is the transformationfrom reference of link i to reference of link j, andξi =

[0 0 0 0 0 ωz

]gives the rotation of link

i around the axis of the associated joint.

3.3 Matching deforming regions

Consider the problem of tracking the position of a rigidobject in a sequence of images. A standard approachis to minimize the sum of squared differences betweenthe intensity of the pixels in the model M and thecorresponding pixels in the images Ii. Here the modelcan be the first frame where the position of the objectis given. Let Ω be the set of points on the model thatbelongs to the object and g(.,Θi) the transformationthat maps points on the model to corresponding pointson the image Ii. Then the function to minimize withrespect to the motion parameters Θi is:

E =nf∑i=1

∫Ω

(Ii(g(p, Θi))−M(p))2 dp (3)

As previously mentioned, this simple solution can-not cope with the significant changes in appearancepresent in the case of motion of body parts.

Our approach is to allow for deviations from theoriginal template by introducing a weight map W (x)that defines the shape of the object. W (p) ranges be-tween 0 and 1, being 1 for points inside the object and0 for points outside. Then we can estimate the posi-tion of the object and its shape my minimizing the costfunctional written in terms of the motion Θi and theweight map W :

E =nf∑i=1

∫Ω

(Ii(g(p, Θi))−M(p))2 W (p)dp∫Ω

W (p)dp(4)

Notice that we have introduced a term at denomi-nator in the cost function. This term prevents to haveW ≡ 0 as solution to the minimization problem.

3.4 Modeling Self-Occlusions

Visual tracking of complex articulated objects such asthe human body demands for explicit modeling of self-occlusions. This can be done by introducing a visibilityfunction V in the cost functional. Consider the case ofan articulated object with L links, we have:

E =∑nf

i=1 Ei(Θi, W ) =

=∑nf

i=1

∑Ll=1

∫Ωl

(Ii(gl(p,Θi))−Ml(p))2Wl(p)Vl(p,Θi)dp∑Ll=1

∫Ωl

Wl(p)Vl(p,Θi)dp(5)

Where W = (W1,W2..., WL) and:

Vl(p, Θ) =

1 if p ∈link l is visible in pose Θ0 otherwise

In order to compute derivatives of the energy withrespect to motion parameters Θ, we need to find ananalytical expression for the visibility function Vl(p, Θ).To this purpose we can use the signed distance functionof a point p from a closed curve. It is defined as theminimum distance of p from a point on the curve, withthe plus sign if p is outside and the minus sign if p isinside the curve. In the case at hand the shapes aresimple and the distance can be analytically computedfrom the its parameters.

Let dl(p) the signed distance function of the projec-tion of p from the contour of the link l on the image.Then we can write V as:

Vl(p, Θ) =∏

j∈F (l)

H(dj(p, Θ))

where F (l) = j : link j is in front of link l andH(.) is the heaviside function:

H(x) =

0 if x < 01 if x ≥ 0

Assuming that the order of visibility defined by F (.)does not change during the motion, we can compute thederivatives of V as:∂Vl(p, Θ)

∂Θ=

∑k∈F (l)

δ(dk(p, Θ))∂dk

∂Θ(p, Θ)

∏j∈F (l),j 6=k

H(dj(p, Θ))

(6)

4 Tracking Algorithm

The first step of the algorithm is to build a model ofthe appearance of the human body in the sequence. Weuse a kinematic chain manually initialized to match thepose of the subject in the first frame. We extract Ml,the appearance model of link l, and its domain Ωl bypicking the region in the first frame corresponding tothe projection of link l on the image.

Then we perform tracking by minimizing the energyfunctional in (5) with respect to Θi and W . We usean alternating minimization scheme: given and initialguess for W minimize with respect to the motions Θi,then fix Θi to these optimal values and minimize withrespect to W . The minimizations are performed usinga gradient descent scheme.

The gradient of (5) with respect to Θi is:

∂E

∂Θi

=1

A

(L∑

l=1

∫Ωl

∆Ii(p, Θi, l)

(∇I

Ti (gl(p, Θi)

∂gl

∂Θi

(Θi)p

)

Vl(x, Θi)Wl(p)dp +

L∑l=1

∫Ωl

∆Ii(p, Θi, l)2 ∂Vl

∂Θi

(x, Θi)Wl(p)dp

− Ei(Θi, W )

L∑l=1

∫Ωl

∂Vl

∂Θi

(p, Θi)Wl(p)dp

)(7)

where

A =L∑

l=1

∫Ωl

W (p)Vl(p, Θi)dp

∆Ii(p, Θ, l) = Ii(gl(p, Θ)−Ml(p)

and: The derivative with respect to the support mapW (.) is:

∇WlE(pk) =

∑i

δ(dl(pk))Vl(pk, Θi)

A

(∆Ii(pk, Θi, l)

2 − Ei(Θi, W ))

(8)

This procedure is applied to blocks of nf frames ofthe sequence to be tracked. nf is a parameter of thealgorithm and determines how many frames are usedfor estimating the shapes Wl. Setting it to 1 is nota good idea because it means using the matching onone single image to update the shape of the object.Using large values such as the number of frames in thesequence is also not advisable because the appearanceof body parts in distant frames can vary considerablyand this would negatively affect the shape estimation.

It must be pointed out that in this formulation wedo not exploit the temporal continuity of the motionsΘi. That is, we would obtain the same results if weapplied the algorithm to a sequence where the order ofthe frames is scrambled. In practice, since the deriva-tive of (5) with respect to Θi is independent from Θj ,we exploit continuity by performing the minimizationseparately for each Θi and by using the optimal valueof Θi to initialize the minimization on Θi+1. Also wedo not constrain the parameters of the kinematic chainto represent configurations physically feasible for thehuman body. Adding to the cost function some termsthat model spatial constraints or joint dynamics wouldpossibly improve the results.

Figure 1: Example of kinematic chain model of thehuman body used for tracking

5 Experimental results

In our preliminary experiments we tracked the posi-tion the body of two subjects performing a walkinggait. The sequences have length of 100 and 43 frames.In both sequences we tracked the position of torso, oneleg and one arm using a kinematic chain model with 5links, pictured in figure 1. It can be seen that the chainmodel has links of two different shapes: ellipsoids andconic sections. For computational efficiency the pro-jection on the image of a conic section is approximatedwith a trapezoid.

The model has been manually initialized in the firstframe of the sequence by specifying the geometry ofthe links and by clicking on their position in the im-age. Once this initialization step is complete the sys-tem performs the tracking task by minimizing (5), al-ternating between motions parameters Θ and shapesWl. In the minimization over the motion parametersnumerical derivatives have been used. The results areobtained by performing the minimization on blocks ofnf = 5 images.

The results are shown in figures 5. In the first andforth row we can see some keyframes of the originalsequences, in the second and fifth we have frames withsuperimposed the estimated kinematic chain and in thethird and sixth superimposed are the estimated shapemaps Wl. We have also included movie clips of originaland tracked sequences.

References

[1] J. K. Aggarwal and Q. Cai Human Motion Anal-ysis: A Review 1999

[2] M. J. Black. Eigentracking: robust matching andtracking of articulated objects. 1996.

Re-

sults of the tracking. The first and forth rows show frames from the original sequences. The second and fifth rowsdisplay the estimated pose of the kinematic chain. The third and last show the estimated weight maps.

[3] M. J. Black. Explaining optical flow events withparameterized spatio-temporal models. In Proc.of Conference on Computer Vision and PatternRecognition, volume 1, pages 326–332, 1999.

[4] M. J. Black and A. D. Jepson. A Probabilis-tic framework for matching temporal trajectories:Condensation-based recognition of gestures and ex-pressions. In Proc. of European Conference onComputer Vision, volume 1, pages 909-24, 1998.

[5] C. Bregler. Tracking people with twists and expo-nential maps. In Proc. International Conference on

Computer Vision and Pattern Recognition, 1998.

[6] C. Bregler. Learning and recognizing human dy-namics in video sequences. In Proc. of the Confer-ence on Computer Vision and Pattern Recognition,pages 568–574, 1997.

[7] J. Deutscher, A. Blake and I. Reid Articulated mo-tion capture by annealing particle filering In Proc.CVPR, pp. 126-133, 2000.

[8] D. M. Gavrila and L. S. Davis. Tracking of humansin action: a 3-d model-based approach. 1996.

[9] D. M. Gavrila. The visual analysis of human move-ment: A survey. In Computer Vision and ImageUnderstanding, volume 73, pages 82–98, 1999.

[10] M. A. Giese and T. Poggio. Morphable models forthe analysis and synthesis of complex motion pat-terns. In International Journal of Computer Vision,volume 38(1), pages 1264–1274, 2000.

[11] F. S. Grassia. Practical parameterization of rota-tions using the exponential map. Journal of Graph-ics Tools, 3(3):29-48, 1998.

[12] N. R. Howe, M. E. Leventon and W. T. FreemanBayesian reconstruction of 3D human motion fromsingle-camera video In Proc. of NIPS, 12, pp.820-826, 2000

[13] M. Isard and A. Blake. Condensation - condi-tional density propagation for visual tracking In-ternational Journal of Computer Vision 29(1), pp.5–28, 1998.

[14] I. A. Kakadiaris and D. Metaxas Model based es-timation of 3D human motion with occlusion basedon active multi-viewpoint selection. In Proc. ofCVPR, pp. 618-623, 1995

[15] M. K. Leung and Y. H. Yang First sight: A humanbody outline labeling system In IEEE Trans. onPAMI, 17(4),pp.359-377, 1995.

[16] J. J. Little and J. E. Boyd. Recognizing people bytheir gait: the shape of motion. 1996.

[17] B. North and A. Blake and M. Isard andJ. Rittscher. Learning and classification of com-plex dynamics. In IEEE Transaction on PatternAnalysis and Machine Intelligence, volume 22(9),pages 1016-34, 2000.

[18] N. Paragios and R. Deriche Geodesic active con-tours and level sets for the detection and trackingof moving objects In PAMI, 22(3), 266-280, 2000

[19] V. Pavlovic and J. Rehg and J. MacCormick. Im-pact of Dynamic Model Learning on Classificationof Human Motion In Proc. International Confer-ence on Computer Vision and Pattern Recognition,2000.

[20] J. M. Regh and T. Kanade. Model-based track-ing of self-occluding articulated objects. In ICCV,1995.

[21] A. Shio and J. Sklansky Segmentation of peoplein motion In Proc. of IEEE Workshop on VisualMotion, pp 325-332, 1991

[22] H. Sidenbladh, M. Black and D. Fleet Stochastictracking of 3d human figures using 2d image motionIn Proc. of ECCV, II pp.307-323, 2000.

[23] S. Soatto and A. Yezzi Deformotion: deformingmotions and shape averages. In Proc. of the ECCV,LNCS, Springer Verlag , May 2002.

[24] L. Torresani, D. Yang, G. Alexander and C. Bre-gler Tracking and Modelling Non-Rigid Objectswith Rank Constraints In Proc. International Con-ference on Computer Vision and Pattern Recogni-tion, 2001.

[25] S. Wachter and H. Nagel Tracking of personsin monocular image sequences In CVIU, 74(3),pp.174-192, 1999

[26] C. Wren. Dynamic models of human motion. 1998.

[27] Y. Yacoob and M. J. Black. Parameterized mod-eling and recognition of activities. In ComputerVision and Image Understanding, volume 73(2),pages 232–247, 1999.


Recommended