+ All Categories
Home > Documents > Online Gesture Spotting from Visual Hull Data

Online Gesture Spotting from Visual Hull Data

Date post: 25-Sep-2016
Category:
Upload: bo-peng
View: 213 times
Download: 1 times
Share this document with a friend
14
Online Gesture Spotting from Visual Hull Data Bo Peng and Gang Qian, Member, IEEE Abstract—This paper presents a robust framework for online full-body gesture spotting from visual hull data. Using view-invariant pose features as observations, hidden Markov models (HMMs) are trained for gesture spotting from continuous movement data streams. Two major contributions of this paper are 1) view-invariant pose feature extraction from visual hulls, and 2) a systematic approach to automatically detecting and modeling specific nongesture movement patterns and using their HMMs for outlier rejection in gesture spotting. The experimental results have shown the view-invariance property of the proposed pose features for both training poses and new poses unseen in training, as well as the efficacy of using specific nongesture models for outlier rejection. Using the IXMAS gesture data set, the proposed framework has been extensively tested and the gesture spotting results are superior to those reported on the same data set obtained using existing state-of-the-art gesture spotting methods. Index Terms—Online gesture spotting, view invariance, multilinear analysis, visual hull, hidden Markov models, nongesture models. Ç 1 INTRODUCTION H UMAN gesture recognition has received considerable attention in the past decade. To enable machines to understand human gestures is critical for developing embodied human-computer interaction (HCI) systems to allow users to communicate with computers through actions and gestures in a much more intuitive and natural manner than traditional interfaces through mouse clicks and keystrokes. Such systems have important applications in virtual reality [1], industrial control [2], healthcare [3], [4], computer games [5], human-robot interaction [6], and interactive dance performance [7], [8]. A gesture recognition system is preferred to be nonin- trusive. Using body-worn sensors such as markers and inertial sensors in gesture recognition is cumbersome and sometimes movement-restraining. For this reason, video sensing has been widely applied in gesture recognition. In practice, gestures need to be simultaneously detected and recognized from continuous movement data. This task is commonly referred to as gesture spotting [9], [10]. Furthermore, online gesture spotting is often desired for real-time processing in which the recognition decision is made using data up to the current observation without having to wait for any future data. In this paper, we present an online, video-based frame- work for view-invariant, full-body gesture spotting. After extracting view-invariant pose features using multilinear analysis from visual hull data, hidden Markov models (HMMs) are trained for gesture recognition by using these pose features as observations. The proposed method has been extensively tested on the IXMAS gesture data set [11] and our results are superior to those reported on the same data set in [11], [12], [13]. The outline of this paper is as follows: The rest of this section reviews state-of-the-art video-based gesture recogni- tion and discusses relevant outstanding challenges. In Section 2, we briefly introduce multilinear analysis and the reduced-parameter HMM. In Section 3, key pose selection is discussed. In Section 4, we present the proposed pose feature extraction method as well as results on view-invariance evaluation. Section 5 introduces the model learning and gesture spotting strategies and our proposed method for autonomous nongesture movement pattern detection and modeling. Experimental results and performance analysis are provided in Section 6. Finally, in Section 7, we conclude the proposed framework and present future research plan. 1.1 Video-Based Gesture Recognition Many video-based methods have been developed for hand [14], [15], [16], arm [17], [18], and full-body [6], [11] gesture recognition. See [19] for a recent literature survey. These methods can be roughly classified as the kinematic-based [5], [6], [15], [20], [21], [22], [23] and the template-based approaches [11], [12], [13], [18], [24], [25], [26]. The kinematic-based approaches use articulated motion para- meters such as joint angle vectors [6], body-centered joint locations [21], or body part positions [5], [15], [20] as features for gesture recognition. The major weakness of such approaches is that reliable articulated motion tracking from video is challenging and kinematic recovery is subject to tracking failures. The template-based approaches represent gestures using features extracted directly from image observations and these approaches can be further split into the holistic and the sequential approaches. The holistic approaches, e.g., [11], [24], [25], represent an entire gesture as a spatio-temporal shape from which features are extracted for gesture recognition. In contrast, the sequential approaches, e.g., [12], [13], [26], represent a gesture as a temporal series of features, one for each time instant. Compared to the holistic approaches, the sequential approaches are more powerful in capturing and modeling variations in gesture dynamics. When spotting gestures from continuous data, the sequential approaches simultaneously detect the gesture boundaries and evaluate IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 6, JUNE 2011 1175 . The authors are with the School of Arts, Media and Engineering, 699 S. Mill Ave., Suite 395, PO Box 878709, Tempe, AZ 85281. E-mail: [email protected], [email protected]. Manuscript received 13 Dec. 2009; revised 17 June 2010; accepted 22 Oct. 2010; published online 9 Nov. 2010. Recommended for acceptance by S. Sclaroff. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMI-2009-12-0817. Digital Object Identifier no. 10.1109/TPAMI.2010.199. 0162-8828/11/$26.00 ß 2011 IEEE Published by the IEEE Computer Society
Transcript
Page 1: Online Gesture Spotting from Visual Hull Data

Online Gesture Spotting from Visual Hull DataBo Peng and Gang Qian, Member, IEEE

Abstract—This paper presents a robust framework for online full-body gesture spotting from visual hull data. Using view-invariant

pose features as observations, hidden Markov models (HMMs) are trained for gesture spotting from continuous movement data

streams. Two major contributions of this paper are 1) view-invariant pose feature extraction from visual hulls, and 2) a systematic

approach to automatically detecting and modeling specific nongesture movement patterns and using their HMMs for outlier rejection in

gesture spotting. The experimental results have shown the view-invariance property of the proposed pose features for both training

poses and new poses unseen in training, as well as the efficacy of using specific nongesture models for outlier rejection. Using the

IXMAS gesture data set, the proposed framework has been extensively tested and the gesture spotting results are superior to those

reported on the same data set obtained using existing state-of-the-art gesture spotting methods.

Index Terms—Online gesture spotting, view invariance, multilinear analysis, visual hull, hidden Markov models, nongesture models.

Ç

1 INTRODUCTION

HUMAN gesture recognition has received considerableattention in the past decade. To enable machines to

understand human gestures is critical for developingembodied human-computer interaction (HCI) systems toallow users to communicate with computers throughactions and gestures in a much more intuitive and naturalmanner than traditional interfaces through mouse clicksand keystrokes. Such systems have important applicationsin virtual reality [1], industrial control [2], healthcare [3], [4],computer games [5], human-robot interaction [6], andinteractive dance performance [7], [8].

A gesture recognition system is preferred to be nonin-trusive. Using body-worn sensors such as markers andinertial sensors in gesture recognition is cumbersome andsometimes movement-restraining. For this reason, videosensing has been widely applied in gesture recognition. Inpractice, gestures need to be simultaneously detected andrecognized from continuous movement data. This task iscommonly referred to as gesture spotting [9], [10].Furthermore, online gesture spotting is often desired forreal-time processing in which the recognition decision ismade using data up to the current observation withouthaving to wait for any future data.

In this paper, we present an online, video-based frame-work for view-invariant, full-body gesture spotting. Afterextracting view-invariant pose features using multilinearanalysis from visual hull data, hidden Markov models(HMMs) are trained for gesture recognition by using thesepose features as observations. The proposed method hasbeen extensively tested on the IXMAS gesture data set [11]and our results are superior to those reported on the samedata set in [11], [12], [13].

The outline of this paper is as follows: The rest of thissection reviews state-of-the-art video-based gesture recogni-tion and discusses relevant outstanding challenges. InSection 2, we briefly introduce multilinear analysis and thereduced-parameter HMM. In Section 3, key pose selection isdiscussed. In Section 4, we present the proposed pose featureextraction method as well as results on view-invarianceevaluation. Section 5 introduces the model learning andgesture spotting strategies and our proposed method forautonomous nongesture movement pattern detection andmodeling. Experimental results and performance analysisare provided in Section 6. Finally, in Section 7, we concludethe proposed framework and present future research plan.

1.1 Video-Based Gesture Recognition

Many video-based methods have been developed for hand[14], [15], [16], arm [17], [18], and full-body [6], [11] gesturerecognition. See [19] for a recent literature survey. Thesemethods can be roughly classified as the kinematic-based[5], [6], [15], [20], [21], [22], [23] and the template-basedapproaches [11], [12], [13], [18], [24], [25], [26]. Thekinematic-based approaches use articulated motion para-meters such as joint angle vectors [6], body-centered jointlocations [21], or body part positions [5], [15], [20] asfeatures for gesture recognition. The major weakness ofsuch approaches is that reliable articulated motion trackingfrom video is challenging and kinematic recovery is subjectto tracking failures.

The template-based approaches represent gestures usingfeatures extracted directly from image observations andthese approaches can be further split into the holistic and thesequential approaches. The holistic approaches, e.g., [11], [24],[25], represent an entire gesture as a spatio-temporal shapefrom which features are extracted for gesture recognition. Incontrast, the sequential approaches, e.g., [12], [13], [26],represent a gesture as a temporal series of features, one foreach time instant. Compared to the holistic approaches, thesequential approaches are more powerful in capturing andmodeling variations in gesture dynamics. When spottinggestures from continuous data, the sequential approachessimultaneously detect the gesture boundaries and evaluate

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 6, JUNE 2011 1175

. The authors are with the School of Arts, Media and Engineering,699 S. Mill Ave., Suite 395, PO Box 878709, Tempe, AZ 85281.E-mail: [email protected], [email protected].

Manuscript received 13 Dec. 2009; revised 17 June 2010; accepted 22 Oct.2010; published online 9 Nov. 2010.Recommended for acceptance by S. Sclaroff.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMI-2009-12-0817.Digital Object Identifier no. 10.1109/TPAMI.2010.199.

0162-8828/11/$26.00 � 2011 IEEE Published by the IEEE Computer Society

Page 2: Online Gesture Spotting from Visual Hull Data

gesture likelihoods for every incoming data frame. This isvery challenging for the holistic approaches without extradelay. For these reasons, in this paper, we focus on template-based, sequential gesture spotting and address the twopressing challenges: reliable view-invariant feature extrac-tion and accurate online spotting.

1.2 Challenges in Gesture Spotting

1.2.1 View-Invariant Recognition

Many HCI systems require view invariance so that gesturescan be spotted independent of the body orientation of thesubject. In our research, we focus on the body orientationangle about the vertical body axis perpendicular to theground plane and through the body centroid. Manymonocular gesture spotting approaches are view dependent[9], [10], [27], [28], [29], [30], [31], i.e., with known bodyorientation angles. Although valid in some scenarios suchas the automatic sign language interpretation, having toknow the body orientation presents an undesirable con-straint, hampering the flexibility and sometimes theusability of an HCI system.

To achieve view invariance, some template-based ap-proaches directly compare the input images against pre-stored image templates corresponding to different views.For example, a view-invariant gesture recognition methodis presented in [13] based on key pose matching and Viterbipath searching. Although promising results have beenreported in [13], such an exhaustive comparison strategyrequires a compromise between view-angle resolution, thenumber of key poses, and computational efficiency. Toaddress this issue, various forms of view-invariant posefeatures derived from visual hull data have been introduced[11], [17], [32]. Once extracted from the input data, thesepose features can be directly matched to the training featuretemplates for gesture recognition. During the extraction ofsuch features, the visual hull data are transformed to the 3Dshape context [17], [32] or the 3D motion history volume[11], and the data points are indexed in a body-centeredcylindrical coordinate system. The angular dimension in thecylindrical coordinate system is then suppressed to obtainpose features independent of the view point. Pose featuresextracted using these methods are view invariant since theorientation of the subject no longer affects the extractedfeatures. However, suppression of the angular dimensionmay cause information loss and introduce ambiguity ingesture recognition.

Pose feature extracted from a pair of input silhouettesusing multilinear analysis has been introduced for pose andgesture recognition [26], [33]. Such pose features are viewinvariant for the key poses used in tensor training.However, the view-invariance property cannot generalizeto the features of other nonkey poses. To summarize,reliable extraction of view-invariant pose features forgesture spotting is a challenge.

1.2.2 Online Gesture Spotting

Many gesture-driven HCI applications require real-timegesture spotting with minimum delay. Hence, it is desirablethat the gesture spotting be done online using only thecurrent and the past movement data. Accurate gesturespotting is also critical for successful HCI applications. Aninteractive HCI system provides feedback to the user in

response to a gesture command. It is nearly impossible forthe system to reverse any issued feedback without disturb-ing the user’s experience. Therefore, reliable online gesturespotting is a pressing challenge for gesture-driven HCI.

A number of pattern analysis frameworks have beenadopted for gesture spotting, including dynamic timewarping (DTW) [34], the HMMs [35], and conditional modelssuch as the maximum entropy Markov model (MEMM) [36]and the conditional random field (CRF) [37]. DTW wasdesigned to evaluate the similarity of two data segments andit has been used in speech and music recognition [38], [39] aswell as gesture recognition and spotting [9], [28], [40].Compared to other models, DTW lacks an effective way tomodel system dynamics and gesture variations.

MEMM [36] and CRF [37] have recently been applied topattern spotting with encouraging results [10]. MEMM andCRF are discriminative state-based models describing theconditional probability of the state sequence given theobservation. MEMM and CRF have been claimed to besuperior to generative models such as HMM because of theirallowance of long-distance dependency [37], [41]. Featuresfrom past and future observations are used to explicitlyrepresent long-distance interactions and dependency, lead-ing to more natural models than those from HMM.However, MEMM and CRF have certain limitations. Apartfrom the label bias problem of MEMM [37], CRF training ismuch more computationally expensive and converges muchslower than those of HMM and MEMM [37], [41]. Thescalability of CRF is another problem. CRF builds a unifiedmodel including all of the patterns to be recognized. As aresult, adding new patterns will require retraining the entiremodel and the previous model has to be discarded.

HMM [35] is a commonly used state-based, generativeframework for sequential pattern analysis. Using HMM, anobservation sequence is modeled as being emitted from thecorresponding hidden states. Gesture spotting has beendone using the HMM network [27], [30], [31], [42], [43], [44],which is formed by connecting in parallel gesture andnongesture HMMs. Usually, one or two nongesture HMMsare used to provide likelihood thresholds for outlierrejection. For instance, in [30], [31], [42], [44], a weakuniversal movement model has been used for adaptivethresholding. Representing nongestures using one or twoHMMs cannot effectively reject complex outliers, e.g., whenthey resemble portions of a gesture. Effective detecting andmodeling of nongesture movement patterns remains achallenge for HMM-based gesture spotting systems.

In summary, to develop online gesture spotting systemsfor gesture-driven HCI, two pressing challenges need to beaddressed: 1) robust view-invariant pose feature extraction,and 2) effective detection and modeling of nongesturemovement patterns for online gesture spotting. In thispaper, we systematically tackle these challenges and makethe following major contributions to online gesture spotting:

. A robust approach to view-invariant pose featureextraction from visual hull data using multilinearanalysis. Experimental results show that the proposedpose features are view invariant for both trainingposes and news poses unseen in training.

1176 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 6, JUNE 2011

Page 3: Online Gesture Spotting from Visual Hull Data

. A systematic approach to detecting and modelingspecific nongesture movement patterns and usingtheir HMMs for outlier rejection. As shown in theexperimental results, using specific nongesturemodels noticeably improves gesture spotting byreducing false alarm rates and increasing therecognition reliability, without significantly sacrifi-cing the recognition rates.

Using the IXMAS data set [11], our proposed gesturespotting framework has been extensively tested and ourresults are superior to those reported on the same data set in[11], [12], [13]. Fig. 1 shows the block diagram of ourproposed system. This paper is extended based on [45] byincluding the nongesture detection and modeling algo-rithm, and the results on view-invariance evaluation andgesture spotting.

2 THEORETICAL BACKGROUND

2.1 Multilinear Analysis

In multilinear analysis, multimode data ensembles arerepresented by tensors as higher order generalization ofvectors and matrices (two-mode tensor). A data ensembleaffected by m factors is represented as an ðmþ 1Þ-modetensor T 2 IRNv�N1�N2...�Nm , where Nv is the length of a datavector and Ni, i ¼ 1; . . . ;m, is the number of possible valuesof the ith factor.

As a generalization of SVD on matrices, the high-ordersingular value decomposition (HOSVD) [46] can decom-pose a tensor A 2 IRN1�N2...�Nn as the following:

A ¼ S �1 U1 �2 U2 . . .�n Un; ð1Þ

where Uj 2 IRNj�N 0j ðN 0j � NjÞ are the mode matrices contain-ing orthonormal column vectors and S 2 IRN 01�N 02...�N 0n is thecore tensor. The mode matrices and core tenor are,respectively, analogous to the left and right matrices andthe diagonal matrix in SVD. Details on tensor algebra suchas the tensor multiplication and HOSVD can be found inAppendix A in the supplemental material for this paper,which can be found on the Computer Society DigitalLibrary at http://doi.ieeecomputersociety.org/10.1109/TPAMI.2010.199. Let uj;k be the kth row of Uj. It followsthat [47]

Ai1;i2;...;in ¼ S �1 u1;i1 �2 u2;i2 . . .�n un;in : ð2Þ

Let Aði1; . . . ; ij�1; :; ijþ1; . . . inÞ be the column vector contain-ing Ai1;...;ij;...;in , ij ¼ 1; . . . ; Nj. Then,

Aði1; . . . ; ij�1; :; ijþ1; . . . inÞ¼ S �j Uj �1 u1;i1 �2 u2;i2 . . .�j�1 uj�1;ij�1

�jþ1 ujþ1;ijþ1. . .�n un;in :

ð3Þ

The coefficients uk;ik in each mode can be considered as

independent factors contributing to the data point and the

interaction of these factors is governed by the tensor

S �j Uj. Due to this factorizing property, multilinear

analysis has been widely used to decompose data ensem-

bles into perceptually independent sources of contributing

factors for face recognition [48], 3D face modeling [49],

synthesis of texture and reflectance [50], movement analysis

and recognition from motion capture data [51], [52], and

image synthesis for articulated movement tracking [53].

2.2 Reduced-Parameter Hidden Markov Models

Currently, HMM [35] is the primary tool for sequential

modeling and inference. When using traditional HMMs for

gesture recognition, usually a large number of parameters

need to be trained. For instance, an n-state HMM has Oðn2Þstate transition parameters to be learned from training data.

Learning a large set of model parameters presents an

outstanding challenge to applications where only limited

training data are available.In [54], the reduced-parameter HMM has been proposed

to address this challenge. The reduced HMM reduces the

number of state transition parameters toOðnÞ. It improves the

computational efficiency of the inference, allowing the

number of states to increase while preserving real-time

recognition. The reduced HMM is also trained using the

expectation-maximization (EM) algorithm. Compared to the

traditional HMMs, due to the reduced parameter size, the

computational complexity for reduced HMM training is

much lower and fewer training samples are required [54].

Because of these advantages, in our proposed gesture

spotting framework we have adopted the reduced HMM to

represent gesture and nongesture movement patterns. On the

other hand, due to the unique parameterization of the state

transition probabilities, the reduced HMM can exactly

represent only a subset of the standard left-to-right HMM.

In our research, this limitation does not create noticeable

issues in spotting gestures from the IXMAS data set,

indicating that the related movement patterns can be well

modeled using the reduced HMM. See Appendix B in the

supplemental material for this paper, which can be found on

the Computer Society Digital Library at http://doi.ieee

computersociety.org/10.1109/TPAMI.2010.199, for more de-

tails on the reduced HMM and [54] for a complete treatment.

3 KEY POSE SELECTION

In our proposed approach, key poses are selected from the

gesture vocabulary for gesture spotting. Good key poses are

quite different from each other to avoid ambiguities in pose

features. In our approach, key pose candidates are first

detected based on the motion energy, and then they are

clustered to locate the key poses as the cluster centers. In

our experiments, the visual hull data in the IXMAS data set

were directly used.Let Vt be the visual hull at time t and Vtði; j; kÞ be the

value of the voxel at location ði; j; kÞ. Define the difference

of visual hull F t:

PENG AND QIAN: ONLINE GESTURE SPOTTING FROM VISUAL HULL DATA 1177

Fig. 1. The block diagram of the proposed gesture spotting framework.

Page 4: Online Gesture Spotting from Visual Hull Data

F tði; j; kÞ ¼ 1; Vtði; j; kÞ ¼ 0 andXtþW=2

�¼t�W=2

V�ði; j; kÞ > 0

0; otherwise;

8><>:

ð4Þ

where W is the width of a time window. Given a movement

sequence of L frames, the “motion energy” at time t,

W þ 1 � t � L�W , is defined as the number of nonzero

voxels in F t, i.e.,

EðtÞ ¼Xi;j;k

F tði; j; kÞ: ð5Þ

In our approach, the key pose candidate is automatically

detected at the local extrema of EðtÞ of the training gesture

data. These key pose candidates are further clustered to

eliminate repetitive poses.Pose clustering requires interpose distance measure. To

suppress distance caused by changes in body shapes and

gesture execution locations, normalization such as centering

and rescaling is applied to the visual hull data, and the

normalized visual hull is used in such distance computa-

tion. Details on the visual hull normalization can be found

in the Appendix C in the supplemental material for this

paper, which can be found on the Computer Society Digital

Library at http://doi.ieeecomputersociety.org/10.1109/

TPAMI.2010.199. Define the distance between two normal-

ized voxel data V and V0:

dV ðV;V0Þ ¼k V � V0 kk V \ V0 k ; ð6Þ

where k � k is the cardinality operator which returns the

number of valid (nonzero) voxels. The intersection (\)

operation is carried out by treating the binary visual hulls as

logical data arrays.Assume that V1 and V2 are the normalized visual hulls of

two poses p1 and p2, respectively. To minimize the impact of

view angles, we define dP ðp1; p2Þ, a view-independent

distance between two poses p1 and p2:

dP ðp1; p2Þ ¼ min�dV ðV1; RðV2; �ÞÞ; ð7Þ

where RðV2; �Þ is the visual hull obtained by rotating V2

counterclockwise about its vertical body axis with angle �.

In practice, given V1 and V2, dP ðp1; p2Þ is found through

exhaustive search over � on a uniform grid over ð0; 2��.Using dP ð�; �Þ, the distance matrix of the candidate key poses

is computed. Then, normalized-cut [55] is used to cluster

the candidate key poses and the resulting cluster centers aretaken as the final key poses.

4 POSE FEATURE EXTRACTION USING

MULTILINEAR ANALYSIS

The visual hull of a human pose is mainly affected by threefactors: the body shape of the subject, the joint angleconfiguration (poses), and the body orientation. In ourresearch, we concentrate on the pose and orientation factors.Concerning the body shape factor, although the visual hullnormalization can reduce its influence to certain extent, asshown by the experimental results in this section, differentbody shapes do introduce relatively large variations to thepose features. This implies that to obtain accurate gesturespotting, the testing subject needs to share a similar bodyshape with at least one of the training subjects. In our futurework, this limitation will be addressed by developing posefeatures invariant to both view and body shape. As shown inFig. 2, the proposed pose feature is obtained by projecting aninput visual hull onto a pose feature space using the coretensor obtained via pose tensor decomposition.

4.1 Pose Tensor Decomposition Using HOSVD

Given selected key poses, a pose tensor can be formed usingtheir normalized voxel data in different orientations asshown in Fig. 3a. Details on the pose tensor formulation inour experiments are presented in Section 6.1. This posetensor can be decomposed using HOSVD to extract the coretensor and mode matrices, as described in Section 2.1. Inour proposed framework, we do not conduct dimensionreduction in any of the modes. According to (1), the posetensor A 2 IRNv�No�Np can be decomposed into

A ¼ S �1 Uv �2 Uo �3 Up; ð8Þ

where S is the core tensor of the same size as A andUv 2 IRNv�Nv , Uo 2 IRNo�No , and Up 2 IRNp�Np are, respec-tively, voxel, orientation, and pose mode matrices. In ourapproach, we only calculate Uo and Up, and the voxel modematrix Uv is combined with S. Thus,

A ¼ D�2 Uo �3 Up; ð9Þ

where D ¼ S �1 Uv. According to (3)

Að:; i; jÞ ¼ D �2 uo;i �3 up;j; ð10Þ

where Að:; i; jÞ is the visual hull vector corresponding topose j in orientation i. The vectors uo;i and up;j are,respectively, the ith row of Uo (i.e., the coefficient vector of

1178 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 6, JUNE 2011

Fig. 2. The diagram of the proposed pose feature extraction process. Fig. 3. (a) The structure of the pose tensor. (b) Examples of reshapedcolumns of the core tensor D.

Page 5: Online Gesture Spotting from Visual Hull Data

orientation i ) and the jth row of Up (i.e., the coefficientvector of pose j). Fig. 3b shows four sample columns of Dremapped onto a cubic space.

Given a new visual hull z, its pose feature vp andorientation feature vo are found by solving a bilinear equation

z ¼ D�2 vo �3 vp: ð11Þ

Intuitively, solving vp and vo from z can be illustrated asa rank-constrained basis projection process. Recall that thecore tensor D is of size Nv �No �Np. Intuitively, D canbe considered to be an No �Np array, with each arrayelement being an Nv � 1 vector. These array elements canbe viewed as a set of basis vectors. Given the Nv � 1observation vector z, solving for vo and vp is equivalent tofinding a linear projection of z onto these basis vectors thatoptimizes the minimum-reconstruction-error criterion andalso satisfies the rank constraint: the resulting projectioncoefficients must form a rank-1 No �Np matrix given byvov

Tp . In other words, the projection coefficient for the basis

vector at the ith row jth column in D must be equal tovoðiÞvpðjÞ, 81 � i � No; 1 � j � Np.

4.2 Pose Vector Extraction Using ALS

The alternating least-squares (ALS) [56] algorithm is oftenused to iteratively solve the bilinear equation (11). Let vðnÞo bethe estimated orientation feature in the previous iteration.Then, D can be flattened into a matrix CðnÞo ¼ D�2 vðnÞo .Inserting CðnÞo into (11) leads to

z ¼ CðnÞo vp: ð12Þ

Thus, the current pose feature estimate vðnþ1Þp can be found

by solving the linear system (12). Similarly, using thecurrent pose feature vðnþ1Þ

p , the orientation feature vo can beupdated by solving a similar linear system:

z ¼ Cðnþ1Þp vo; ð13Þ

where Cðnþ1Þp ¼ D�3 vðnþ1Þ

p . Given the initial value vð0Þp orvð0Þo , vo and vp can be iteratively updated by alternatelysolving (12) and (13) until convergence.

Initialization is critical for ALS. In our research, we haveadopted the following initialization strategy. First, all of therow vectors fuo;igNo

i¼1 of Uo are used as the initial values forvo. For each uo;i, the corresponding pose vector vp;i isobtained by solving (12) only once. From fvp;igNo

i¼1, the posevector yielding the smallest distance to one of the standardposes fup;igNp

i¼1 is then chosen as vð0Þp to initialize ALS:

vð0Þp ¼ arg maxvp;i

maxj

vp;i � up;jkvp;ik � kup;jk

; ð14Þ

where i ¼ 1; . . . ; No is the orientation index and j ¼1; . . . ; Np is the pose index.

4.3 Evaluation of View Invariance

It is important to systematically evaluate the view invarianceof the proposed pose features. In our research, we haveperformed a series of evaluation studies using data from theIXMAS gesture recognition data set [11]. The IXMAS dataset contains calibrated multiview silhouette and visual hulldata from 12 subjects performing 14 daily actions. For eachsubject, three movement trials were included in the data set,

each containing various execution of all 14 actions. To becomparable, in our experiments, we have used data from thesame 10 subjects and 11 actions (Table 10) as those in [11]and [12]. The pose tensor used in these studies was formedusing data from one of the 10 subjects. Details on formingthis pose tensor are given in Section 6.1.

In the first study, pose features were extracted from thenormalized voxel data corresponding to two poses indifferent orientations. Figs. 4a and 4b, respectively, showthe normalized voxel data corresponding to a key pose and anonkey pose in 16 orientations. These testing data have beenselected from another subject different from the one used forpose tensor construction. Figs. 4c and 4f show the corre-sponding pose features obtained using multilinear analysis.It can be seen that these pose features are invariant to bodyorientations for both the key pose and the nonkey pose.

In the second study, we examined the robustness of theproposed pose features in the presence of visual hull errors.In this study, the protrusion errors and the partial occlusionerrors were added to the testing visual hull data used in theprevious study (Figs. 4a and 4b). Protrusion errorscorrespond to the uncarved large blocks of backgroundvoxels missed during the visual hull extraction. To add aprotrusion error to a visual hull, we first select a protrusionsphere with a random center in the background and aradius of three voxels (a 10th of the side length of thenormalized visual hull). To realistically synthesize aprotrusion error, a valid protrusion sphere is required tooverlap with the visual hull in a volume less than half of thesphere. Otherwise, another random sphere will be selected

PENG AND QIAN: ONLINE GESTURE SPOTTING FROM VISUAL HULL DATA 1179

Fig. 4. Voxel data of (a) a key pose and (b) a nonkey pose in 16 bodyorientations and their corresponding pose vectors in (c) and (f),respectively. Subplots (d) and (g) show the pose features extractedfrom the voxel data corrupted by protrusion errors corresponding to thekey pose data in (a) and nonkey pose data in (b), respectively.Subplots (e) and (h) show the pose features extracted from the noisydata, with partial occlusions corresponding to the key pose and nonkeypose data, respectively.

Page 6: Online Gesture Spotting from Visual Hull Data

until a valid protrusion sphere is found. Once such a sphereis found, all of the voxels inside the sphere are consideredprotrusion voxels, with their values set to 1.

Partial occlusion errors correspond to the large blocks offoreground voxels wrongly carved during the visual hullextraction. To add a partial occlusion error to a visual hull, wefirst select an occlusion sphere with a random center in theforeground (on the subject) and a radius of three voxels. Then,all the voxels inside this occlusion sphere are treated to beoccluded with their values set to 0. Fig. 5 shows examples ofan original visual hull (Fig. 5a) and its two noisy versionscorrupted by protrusion (Fig. 5b) and partial occlusion(Fig. 5c) errors. Pose features extracted from the noisy dataare shown in Figs. 4d and 4g (extracted from the noisy datawith protrusion errors) and Figs. 4e and 4h (from the noisydata with partial occlusions). It can be seen from these figuresthat the pose features extracted from the noisy data stilllargely resemble those from the original data. Hence, theview-invariance property still holds, in general, when thevisual hull data are corrupted by protrusion and partialocclusion errors.

In the third study, we have further evaluated the viewinvariance of the proposed pose features in a quantitativemanner. In this study, we examined the interorientationsimilarity of the pose features corresponding to the samepose in different orientations. For each subject, werandomly selected 100 frames of visual hull data from oneof the three movement trials of the subject in the IXMASdata set. A total of 1,000 frames from 10 subjects wereselected. Among these frames, 318 frames have small(< 0:6) distances (as defined in (7)) to some of the keyposes. These frames are referred to as the key pose frames andthe rest the nonkey pose frames. Each visual hull frame wasrotated to 16 facing directions and the corresponding posefeatures as well as their pair-wise similarities wereobtained. The minimum value of these similarities, definedas the minimum interorientation similarity (MIOS), is used toquantify the degree of view invariance of the pose featuresfor this visual hull frame. The 10-bin MIOS histograms forboth the key pose and the nonkey pose frames are shown inFigs. 6a and 6b, respectively. As a context, Fig. 6e shows thehistogram of the pair-wise interframe similarities betweenthe 1,000 testing visual hull frames. From Fig. 6, it can beseen that the MIOS values are more than 0.9 for all the keypose frames and the majority (619 out of 682) of nonkeypose frames. Only a small percentage (6.3 percent) of thetesting frames have low MIOS values. Hence, we experi-mentally verified the view-invariance property of theproposed pose features. The low MIOS values (i.e.,discrepancy in pose features of the same pose in differentorientations) are mainly due to the existence of multiplesolutions when using ALS to find the pose features. In ourgesture spotting experiments using the IXMAS data set, thisissue did not present a significant problem.

The same study has been repeated using noisy data. Toeach voxel data used in the study, protrusion errors wereadded with a probability of 1

3 and partial occlusion errorswith a probability of 1

3 , and there is a probability of 13 that

the data are unchanged. Using such noisy data, the MIOShistograms of the key pose and the nonkey pose frameswere obtained as shown in Figs. 6c and 6d, respectively. It isclear that the voxel errors only slightly affected the viewinvariance of the pose feature.

To examine the impact of different body shapes, we havecompared pose features extracted from the same pose acrossdifferent people. Fig. 7a shows the 10 visual hulls of a poseperformed by the 10 IXMAS subjects. These testing datashare similar views so that the impact of body shape on thepose feature can be studied independently from that ofview. The corresponding pose features are given in Fig. 7b.It can be seen that although generally similar to each other,these pose features across different subjects are lessconsistent than those over different views as shown in Fig. 4.

We have further examined the interpeople similarity of theproposed pose feature. In our research, 25 key poses (Fig. 11)were selected from the IXMAS data set for gesture spotting. Inthis study, for each key pose, its voxel data from the 10 IXMASsubjects were first aligned in the same body orientation. Afterextracting the pose features from the aligned voxel data, theaverage pair-wise interpeople similarities of these posefeatures for the given key pose were calculated. This processwas repeated for all the 25 key poses. Once the 25 averageinterpeople similarities were computed, one for each keypose, their histogram was obtained as shown in Fig. 7c. Thishistogram has 10 bins with a total count of 25, the number ofdata points. As shown in Fig. 7c, five bins sit to the right of 0.5,with a total count of 16. This implies that out of the 25 keyposes, 16 of them have the average interpeople similaritygreater than or equal to 0.5. It is clear that changes in bodyshapes do introduce relatively large variations. For gesturespotting, this implies that the body shape of the testingsubject needs to be similar to one of the training subjects,which is certainly a limitation of the proposed frameworkand is to be addressed in our future work.

1180 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 6, JUNE 2011

Fig. 5. Examples of visual hull errors. (a) The original voxel data. (b) Thenoisy data corrupted by a random protrusion error (red sphere). (c) Thenoisy data with a random partial occlusion error (green sphere).

Fig. 6. Similarity histograms of pose features. (a)-(d) Histograms ofinterorientation similarities for key pose data, nonkey pose data, noisykey pose data, and noisy nonkey pose data, respectively. (e) Similarityhistogram of the pose features from 1,000 visual hull frames.

Fig. 7. (a) Voxel data of a pose performed by 10 subjects. (b) Thecorresponding pose vectors. (c) The histogram of the average inter-people similarities.

Page 7: Online Gesture Spotting from Visual Hull Data

4.4 Analysis of Orientation Coefficient Vectors

When extracting the pose feature from a new observation,the corresponding orientation feature vo is also availablefrom the ALS solution. The orientation features of differentposes in the same body orientation are close to each other.For example, Fig. 8b shows similar orientation features ofthe 25 key poses roughly aligned in the same bodyorientation (Fig. 8a). Using these orientation features, thebody orientation angle can be estimated through manifoldlearning as shown in [33].

5 GESTURE SPOTTING USING HMM

Using the proposed pose features as observations, gesturescan be spotted from continuous data by using an HMMnetwork [27], [30]. As illustrated in Fig. 9, an HMM networkis formed by a number of parallel branches connecting thenonemitting starting and end states through movementHMMs, including gesture and nongesture models. Thesame gesture models (GM) are used for gesture spottingfrom continuous data and gesture recognition from pre-segmented data. The nongesture models contain a generalgarbage model (GGM) and additional HMMs representingspecific nongesture movement.

5.1 Model Learning

In the proposed framework, the reduced HMM introduced inSection 2.2 has been used to create gesture models. Eachemitting state is modeled as a Gaussian mixture with adiagonal covariance matrix. Assume that there areN gesturesin the gesture vocabulary G. For each gesture g 2 G, thecorresponding HMM with model parameter set �g is learnedusing the EM algorithm from the associated training samplesmanually segmented from training data. Once these gesturemodels are learned, they can be used to classify presegmentedmovement data. Let O ¼ fO1; O2; . . . ; Otg be a sequence ofpose features obtained via multilinear analysis from a gesturemovement segment. This movement segment is then classi-fied as the gesture g� yielding the maximum likelihood, i.e.,

g� ¼ arg maxg2G

pðOj�gÞ: ð15Þ

The number of states in these HMMs is determined by crossvalidation and linear search. The same parameters are alsoused for gesture spotting.

5.1.1 Nongesture Models

Effectively rejecting nongesture movements is a keychallenge for gesture spotting. In [27] and [30], one or twogeneral garbage gesture models are used to model all the

nongesture movement patterns. In our proposed frame-work, a garbage gesture model is also used to modelgeneral nongesture movement sequences. It has a singleemitting state, with a flat probability distribution functionover the entire observation space. The single state can loopback to itself (nongesture continues) or exit (nongestureends). Applying this garbage model is equivalent to settinga threshold on the normalized log-likelihood of the spottedgestures, where this threshold is simply given by thelogarithm of the flat probability of the emitting state of thegarbage model.

As discussed earlier, using one or two general garbagemodels is not effective in outlier rejection in gesturespotting. To tackle this challenge, in addition to a generalgarbage gesture model, we have also deployed a number ofspecific nongesture models, including automatically identi-fied and manually specified nongesture models. The goal isto represent specific nongesture movement patterns in thetraining data, and then use them to reject similar outlierpatterns in gesture spotting. Some nongesture patterns aremanually picked, e.g.,

. repetitive intergesture patterns (such as stand still),

. false gestures that are similar to the true ones, and

. movement patterns shared by two or more gestures.

Such patterns are common in the training data and theirHMMs are trained identically as the gesture HMMs.

In addition, we have developed an approach (Fig. 10a) toautomatically detecting and modeling nongesture move-ment patterns. First, the training movement sequences aresegmented into element pieces by finding the minima of themotion energy defined in Section 3. Element pieces thatoverlap with training segments of the gestures andmanually specified nongestures (if there are any) areeliminated, leaving only unused element pieces corre-sponding to remaining nongesture movement patterns.Then, the similarity matrix of these element pieces is foundusing DTW [34] based on the euclidean distance. Accordingto this similarity matrix, the nongesture training data aregrouped into a number of clusters using normalized-cut[55], one cluster for each automatically detected nongesturemodel. The number of clusters is preset manually. Finally,the element pieces close to the cluster centers are taken asthe training samples to train the corresponding nongestureHMM. In practice, the number of nongesture models can beflexibly tuned according the specific applications.

5.2 Gesture Spotting

Using both gesture and nongesture HMMs, gesture spottingcan be achieved by evaluating the joint probability of the

PENG AND QIAN: ONLINE GESTURE SPOTTING FROM VISUAL HULL DATA 1181

Fig. 8. Voxel data of 25 poses performed in (a) the same orientation and(b) their orientation vectors. Fig. 9. The HMM network used in gesture spotting.

Page 8: Online Gesture Spotting from Visual Hull Data

observation sequence and the path of state transition in anHMM network. During gesture spotting, at each time t, oneframe of pose feature ot is input to the HMM network. Let qtbe the hidden state at time t, Sn be the nth hidden state inthe HMM network, S be the collection of all the states of theHMM network, and Ot be the observation sequence fromthe beginning of the movement piece up to time t. Let

�tðSnÞ ¼ maxq1;...;qt�1

pðOt; q1; . . . ; qt�1; qt ¼ Snj�Þ; ð16Þ

be the joint probability of Ot and the optimal state path tothe current state qt ¼ Sn. In (16), � is the parameter set ofthe entire HMM network. Using the Viterbi algorithm,�tðSnÞ can be computed based on �t�1ðSÞ, 8S 2 S, and thecurrent observation in an incremental manner.

Using the reduced HMM, each gesture model contains anonemitting end state. Reaching this end state indicates theexecution of the corresponding gesture. Let Eh be the endstate of HMM h. Let G be the gesture vocabulary and F bethe nongesture set. At time instant t, if the end probabilityof a gesture g� is the largest among all the gestures andnongestures, g� is spotted, i.e.,

g� ¼ arg maxh2G[F

�tðEhÞ and g� 2 G: ð17Þ

Once a gesture is detected, its starting time can be easilybacktracked along the most probable path.

This preliminary spotting result is further refined. Alength constraint and a likelihood constraint are set up toreject outliers. In our experiments, we require the length of aspotted gesture to be shorter than 50 frames (since the lengthof ground-truth gesture segments is in the range of 10 to35 frames), and the likelihood of the spotted gesture segmentto be larger than 10�80 (the majority of the training gesturelikelihoods are in the range of 10�20 to 10�50). Movementsegments satisfying both constraints are admitted as gesturesegment candidates. Then, temporal consistency is further

used to stabilize the gesture spotting results. To be specific,only when the gesture segment candidates sharing the samestarting frame are continually detected T (a prechosenthreshold) times without other candidates detected in themiddle is a final spotting decision made. This gesturespotting scheme is summarized in Fig. 10b. The spottingresult is a series of spotted gesture segments marked by theirbeginning and end frame numbers together with theirgesture labels and likelihood values.

6 EXPERIMENTS AND ANALYSIS

Our proposed gesture spotting framework has been testedusing the IXMAS gesture data set [11] and our results aresuperior to those reported in [11], [12], [13] on the same dataset. To be comparable, in our experiments, we only useddata from the same 10 subjects and 11 actions (Table 10) asthose in [11] and [12].

6.1 Key Pose Selection and Pose Tensor Formation

Using the proposed key pose selection method, as shown inFig. 11 in their most distinguishable views, 25 key poseswere selected from a movement trial of Florian (one of the10 subjects). The visual hull data of these key frames fromFlorian’s data were then used to form the pose tenor. Theywere first normalized to the size of 30� 30� 30, and thenmanually aligned to approximately share the same facingdirection. To obtain the training pose tensor, each alignedvisual hull was further rotated about its vertical body axisto generate voxel data facing 16 directions, evenly dis-tributed from 0 to 15

8 � with 18� between adjacent views.

Their mean is also subtracted to center the voxel data. Theresulting centered voxel data from all key poses were thenarranged into the three-mode 303 � 16� 25 training posetensor. This pose tensor was further decomposed to obtainthe core tensor using HOSVD. Given the core tensor, thepose feature of an input visual hull frame can be foundusing ALS.

6.2 Training and Testing Schemes

Following [11] and [12], we have evaluated our proposedapproach through cross validation. In each training andtesting cycle, the movement data and associated pose featuresof nine of the 10 subjects in the IXMAS data set were used asthe training data and those of the remaining subject were thenused for testing. This procedure was repeated 10 times so thateach subject was used once as the testing subject. The finalresults reported in Sections 6.3 and 6.4 were based on thecumulative results obtained in all 10 training-testing cycles.

6.2.1 Movement Signature and Ground-Truth Data

Following [11] and [12], in our research, an elementalmovement was selected for each action as the correspondingrepresentative signature. For example, for the “checkwatch” action, the “raise hand” motion was selected asthe movement signature. The 11 action signatures form the

1182 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 6, JUNE 2011

Fig. 10. (a) Automatic detection and HMM training of nongesturemovement patterns. (b) The flowchart of the gesture spotting algorithm.

Fig. 11. The 25 key poses selected from the IXMAS data set.

Page 9: Online Gesture Spotting from Visual Hull Data

gesture vocabulary in our experiments. All of the move-ment segments corresponding to the action signatures weremanually identified from the IXMAS data set and used astraining and ground-truth data. They are referred to as thegesture segments in this section.

6.2.2 Training

In each training-testing cycle, on average about 283 gesturesegments were used to train the 11 gesture HMMs. To beconsistent with [11], for gestures executed multiple times ina training movement trial, only one of them was (randomly)selected to be included in the training set.

For gesture spotting, nongestures HMMs also need to betrained. The training data of the manually selected non-gestures were hand picked and those of the automaticallydetected nongestures were obtained during nongesturemovement detection as discussed in Section 5.1.1.

6.2.3 Testing

In each training-testing cycle, once the 11 gesture HMMswere learned, the proposed framework was tested firstusing presegmented data. The testing data were from thegesture segments of the testing subject. To be consistentwith [11], for gestures executed multiple times in a singlemovement trial, only one of them was used in testing. Onaverage, about 31 testing movement segments were used ina testing cycle. In each testing cycle for gesture spotting, allthree complete movement trials of the testing subject wereused as testing data.

6.3 Gesture Recognition Using Presegmented Data

To evaluate gesture recognition using presegmented testingdata, the recognition and the false alarm rates for both theindividual gestures and entire gesture vocabulary wereobtained according to Table 1 and are given in Table 2. Thecorresponding confusion matrix is shown in Table 3. Table 2also includes the gesture recognition results on the samedata set reported in [11] and [12]. It can be seen that ourresults using the proposed pose features are slightly betterthan the existing results.

A valid question to ask is whether it is necessary toextract pose features for gesture recognition. Since the 25 keyposes have been identified, a possible brute-force gesturerecognition approach is to first match an input frame to oneof the key poses to obtain a discrete-observation model, andthen HMMs can be used for gesture recognition. In our

research, we have implemented this simple brute-forcediscrete-observation method and have compared it with ourmethod using pose features. In this discrete method, aninput frame is assigned with the ID of its closest key poseneighbor according to the interpose distance measure (7)using 16 search angles on a uniform grid over ð0; 2��. Then,the reduced HMMs using the pose ID as the discreteobservations are used for gesture modeling and recognition.The results are in Table 2. It is clear that such a simplediscrete method performs much worse than the proposedmethod. Hence, it is valid to extract the proposed view-invariant pose features for reliable gesture recognition.

6.4 Gesture Spotting Using Continuous Data

6.4.1 Evaluation Criteria

To evaluate the proposed gesture spotting framework, wehave analyzed and compared our spotting results againstthe ground-truth data in a number of aspects, including thetemporal matching accuracy, the recognition and falsealarm rates, and the reliability of recognition.

Let FbðiÞ and FeðiÞ be the beginning and end framenumbers of the ith true gesture segment in the testing data.

PENG AND QIAN: ONLINE GESTURE SPOTTING FROM VISUAL HULL DATA 1183

TABLE 1Notations for Evaluation Using Presegmented Data

TABLE 2Gesture Recognition Results Using Presegmented Data

TABLE 3Confusion Matrix (in Percent)

Page 10: Online Gesture Spotting from Visual Hull Data

The length of this segment is LGT ðiÞ ¼ FeðiÞ � FbðiÞ þ 1. LetSbðiÞ and SeðiÞ be the beginning and end frame numbers of aspotted gesture segment. Define the absolute temporal

matching scoreOAðiÞ as the number of the overlapped framesbetween the spotted and ground-truth gesture segments:

OAðiÞ ¼ minfSeðiÞ; FeðiÞg �maxfSbðiÞ; FbðiÞg; ð18Þ

and the absolute temporal matching score ORðiÞ as the ratioof OAðiÞ to the length of the true segment:

ORðiÞ ¼ OAðiÞ=LGT ðiÞ: ð19Þ

When ORðiÞ is larger than a prechosen threshold �

(0 < � � 1), the spotted gesture is considered temporallymatched to the ground-truth segment. In our experiments,the default value of � was 0.5.

Once a spotted gesture segment is temporally matched toa ground-truth segment, their gesture labels are compared.If they share the same gesture label, correct spotting occurs.Otherwise, a substitution error occurs. If a gesture segmentis not temporally matched to any of the ground-truthsegments with respect to the prechosen �, an insertion erroroccurs. On the other hand, if a ground-truth gesture segmentis not matched to any spotted gesture segment, a missingerror occurs. Revelent notations are given in Table 4. When atrue gesture segment is not correctly spotted, there are twopossibilities: either not detected (missing) or detected butmisrecognized (substitution). Therefore, the number ofunspotted gesture segments ET ¼ NT �NC ¼ ES þ EM .Table 5 lists the indicators used in our analysis to measurethe spotting and temporal matching accuracies.

6.4.2 Experimental Results

To examine the impact of using specific nongesture modelson gesture spotting, we have experimented with variousHMM combinations. We first started with the gesture modelsand the general garbage model. Then, the automaticallydetected nongesture models (ANGM) and the manuallyselected nongesture models (MNGM) were gradually added.

The gesture spotting accuracy and temporal matchingaccuracy (when � ¼ 0:5) using various models are given inTables 6 and 7, respectively. From Table 6, it can be seenthat using more nongesture models greatly reduced theinsertion errors without significantly diminishing correctrecognition. Consequently, the reliability of the spottedgestures greatly increased. It can also be seen from Table 6that when more ANGMs were used, adding MNGMs only

slightly improved the spotting accuracy. From Table 7, wecan see that different gesture model combinations had onlyvery slight impact on the temporal matching accuracy andthe resulting temporal matching accuracy measures are allat reasonable levels.

To examine the influence of the temporal matchingthreshold � on gesture spotting, results using different valuesof � have been obtained as shown in Table 8. The correspond-ing HMM combinations are MNGM+15ANGM+GM+GGMand 15ANGM+GM+GGM. It is clear from Table 8 that when �decreased, both the recognition rate and reliability increasedand meanwhile the insertion errors and the false alarm ratereduced. This is because when � is low, more spotted gesturesegments are matched to true gesture segments. Moreover,Table 8 also indicates that � affected the distributions of thesubstitution and the missing errors. Recall that ET ¼ ES þEM . It can be seen from Table 8 that when � was decreasing,both ET and EM were decreasing while ES was increasing.This is because reducing � allows more spotted gestures to betemporarily matched to true gesture segments (thus reducingEM ). Meanwhile, not all of the newly admitted segments havethe same gesture labels as the true gesture segments, thusleading to increased ES .

To examine how the proposed method can spot gesturesfrom multiple testing subjects, we have run tests usingIXMAS subjects 6 to 10 for testing and the rest for training.Using all the nongesture models with � ¼ 0:5, the resultingRR is 74.54 percent and FAR 10.38 percent, which arecomparable to those obtained using one testing subject (RR:80.14 percent, FAR: 10.16 percent, row 6 of Table 9).

6.4.3 Comparison with Existing Methods

Gesture spotting results using continuous streams from theIXMAS data set have been reported in [11] and [13]. Table 9shows the comparison between our results and those in [11].It can be seen that our method using different � consistentlyachieved higher recognition rates and lower false positiverates than those in [11]. Moreover, the way we evaluate thegesture spotting accuracy is much stricter and morecomplete than that used in [11]. Differently from our methodwhere gestures were spotted directly from continuous

1184 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 6, JUNE 2011

TABLE 4Notations for Gesture Spotting Evaluation

TABLE 5Performance Indicators for Gesture Spotting (Summations Are

over the Correctly Spotted Gesture Segments)

Page 11: Online Gesture Spotting from Visual Hull Data

movement data, in [11], Weinland et al. first segmented thetesting movement trial data using motion energy, and thenclassified the resulting movement segments as eithergestures or nongestures. To compute the recognition andfalse alarm rates, ground truth was obtained manually on

top of the segmented data. Consequently, the spottingresults in [11] do not take into account the segmentationerrors. For example, if the segmentation algorithm wronglygrouped two gestures, the combined segment will be treatedas a nongesture movement segment in the ground truth andthis segmentation error will not be reflected in the spottingresults. In practice, segmentation errors also lead to gesturespotting errors. Obviously, it is suboptimal in gesturespotting evaluation to obtain ground truth purely based onthe segmented data and omit segmentation errors inrecognition rate computation. The true gesture recognitionrate (also used in our research) should be the number ofcorrectly spotted gesture (NC) divided by the number of truegestures (NT ). On the other hand, when the errorsintroduced by wrong segmentation are not considered, theresulting recognition rate is then NC divided by NS , thenumber of correctly segmented gestures. Since NS is alwaysless than or equal to NT , the recognition rate withoutcounting the segmentation errors will be always higher thanor at most equal to the actual recognition rate. Even using astricter method for recognition rate computation, it can beseen in Table 9 that our proposed method consistentlyoutperformed the method in [11].

PENG AND QIAN: ONLINE GESTURE SPOTTING FROM VISUAL HULL DATA 1185

TABLE 8Gesture Recognition Accuracy with Various � Values and Two HMM Network Models:

15ANGM+GM+GGM/MNGM+15ANGM+GM+GGM

TABLE 9Comparison of Gesture Recognition Accuracy

TABLE 7Temporal Matching Accuracy of Correctly Recognized Gestures (� ¼ 0:5)

TABLE 6Gesture Recognition Accuracy (� ¼ 0:5)

Page 12: Online Gesture Spotting from Visual Hull Data

To demonstrate the advantage of using the reduced HMMin gesture spotting, we have obtained results using a left-to-right chain HMM without state skips. As shown in Table 9,the results obtained using the reduced model are signifi-cantly better than those of the chain model, especially interms of the false alarm rates. The recognition rates obtainedusing the reduced model are slightly better than those fromthe chain model, while the false alarm rates from the reducedmodel are only half of those from the chain model. Therefore,using the reduced HMM does improve gesture spottingcompared to the traditional left-to-right HMM.

A monocular, template-based gesture spotting system isintroduced in [13], where the percentage of correctly labeledframes has been used to measure the gesture spottingaccuracy. In [13], the global optimal path of the HMM statesobtained at the end of the data stream has been used toderive the frame labels. We have also obtained such a globaloptimal path from our HMM implementation and com-puted the frame-wise recognition rates as shown in Table 10.It can be seen that our overall result is better than that in [13].Table 10 also includes frame-wise online spotting resultsfrom our method. It is clear that using the global optimalpath increases frame-wise spotting accuracy. The gesture setin [13] includes all 14 gestures in the IXMAS data set, plus anew “stand still” gesture. To be consistent with [11] and [12],our experiments were done using 11 gestures. The compar-ison with [13] was also based on these gestures.

In addition to improved results, our proposed systemhas other advantages over that in [13]. A major weakness ofthe method in [13] is that the matching templates for gesturespotting depend on the specific camera tilt angle. In [13], aninput image is matched to synthetic image templatespregenerated for the given tilt angle. Such image templatesmust be generated again when a new tilt angle is adopted.This constraint makes the system setup more time andresource demanding. Although it is possible to store thematching templates for a set of tilt angles, it is unclear howwell the method in [13] performs when the actual cameratilt angle is different from its closest neighbor in theprestored angles. In contrast, the core tensor in our methoddoes not depend on the camera setup and it can be used toextract the pose feature from any voxel data. In addition,our approach requires less training data/equipment than

that in [13], which needs motion capture data and animatesoftware to synthesize the matching templates. Motioncapture will become necessary when the required motiondata are not available in a public database. In contrast, otherthan the visual hull data, our method does not require anymotion capture data/system or animation software.

6.5 Computational Complexity

The proposed gesture spotting approach is implementedusing Matlab. On average, it takes 2.4 seconds to processone frame of visual hull data. Pose feature extraction usingALS is the most time-consuming step. An optimized C++implementation can greatly reduce the running time. Infact, we have implemented a near-real-time, image-basedgesture spotting system [26], running up to 15 fps on astandard PC (3.6 GHZ dual-core Intel Xeron CPU, 3.25 GBRAM, Windows XP Professional).

7 CONCLUSIONS AND FUTURE WORK

In this paper, we present a gesture spotting frameworkfrom visual hull data. View-invariant features are extractedusing multilinear analysis and used as input to HMM-basedgesture spotting. As shown by the experimental results, theproposed pose features exhibit satisfying view-invarianceproperties, and using specific nongesture models improvesgesture spotting in terms of lower false alarm rates andhigher recognition reliability, without sacrificing much ofthe recognition rates.

In our future work, we will improve the proposed posefeature to achieve body-shape invariance. We will alsoperform research on obtaining more complete solutions topose feature extraction via multipoint initialized ALS.

ACKNOWLEDGMENTS

This work was supported in part by US National ScienceFoundation grants RI-04-03428 and DGE-05-04647. Theauthors are thankful to the referees for their insightfulcomments and to Stjepan Rajko for developing andreleasing the reduced HMM code to the public andproofreading an early version of the paper.

REFERENCES

[1] C. Cruz-Neira, D.J. Sandin, T.A. DeFanti, R.V. Kenyon, and J.C.Hart, “The Cave: Audio Visual Experience Automatic VirtualEnvironment,” Comm. ACM, vol. 35, no. 6, pp. 64-72, 1992.

[2] T. Starner, B. Leibe, D. Minnen, T. Westyn, A. Hurst, and J. Weeks,“The Perceptive Workbench: Computervision-Based GestureTracking, Object Tracking, and 3d Reconstruction of AugmentedDesks,” Machine Vision and Applications, vol. 14, pp. 59-71, 2003.

[3] C. Keskin, K. Balci, O. Aran, B. Sankur, and L. Akarun, “AMultimodal 3D Healthcare Communication System,” Proc. 3DTVConf., pp. 1-4, 2007.

[4] A. Camurri, B. Mazzarino, G. Volpe, P. Morasso, F. Priano, and C.Re, “Application of Multimedia Techniques in the PhysicalRehabilitation of Parkinsons Patients,” J. Visualization and Compu-ter Animation, vol. 14, pp. 269-278, 2003.

[5] H.S. Park, D.J. Jung, and H.J. Kim, “Vision-Based Game InterfaceUsing Human Gesture,” Advances in Image and Video Technology,pp. 662-671, Springer, 2006.

[6] S.-W. Lee, “Automatic Gesture Recognition for Intelligent Human-Robot Interaction,” Proc. IEEE Int’l Conf. Automatic Face and GestureRecognition, pp. 645-650, 2006.

1186 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 6, JUNE 2011

TABLE 10Comparison of Per-Frame Accuracy of Gesture Spotting

Page 13: Online Gesture Spotting from Visual Hull Data

[7] A. Camurri, S. Hashimoto, M. Ricchetti, A. Ricci, K. Suzuki, R.Trocca, and G. Volpe, “Eyesweb: Toward Gesture and AffectRecognition in Interactive Dance and Music Systems,” ComputerMusic J., vol. 24, no. 1, pp. 57-69, 2000.

[8] G. Qian, F. Guo, T. Ingalls, L. Olson, J. James, and T. Rikakis, “AGesture-Driven Multimodal Interactive Dance System,” Proc. IEEEInt’l Conf. Multimedia and Expo, pp. 1579-1582, 2004.

[9] Y. Zhu and G. Xu, “A Real-Time Approach to the Spotting,Representation, and Recognition of Hand Gestures for Human-Computer Interaction,” Computer Vision and Image Understanding,vol. 85, pp. 189-208, 2002.

[10] H.-D. Yang, S. Sclaroff, and S.-W. Lee, “Sign Language Spottingwith a Threshold Model Based on Conditional Random Fields,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 7,pp. 1264-1277, July 2009.

[11] D. Weinland, R. Ronfard, and E. Boyer, “Free Viewpoint ActionRecognition Using Motion History Volumes,” Computer Vision andImage Understanding, vol. 104, nos. 2/3, pp. 249-257, 2006.

[12] D. Weinland, E. Boyer, and R. Ronfard, “Action Recognition fromArbitrary Views Using 3D Exemplars,” Proc. IEEE Int’l Conf.Computer Vision, pp. 1-7, 2007.

[13] F. Lv and R. Nevatia, “Single View Human Action RecognitionUsing Key Pose Matching and Viterbi Path Searching,” Proc. IEEEConf. Computer Vision and Pattern Recognition, 2007.

[14] H. Francke, J.R. del Solar, and R. Verschae, “Real-Time HandGesture Detection Recognition Using Boosted Classifiers andActive Learning,” Advances in Image and Video Technology, pp. 533-547, Springer, 2007.

[15] G. Ye, J.J. Corso, D. Burschka, and G.D. Hager, “VICS: A ModularHCI Framework Using Spatiotemporal Dynamics,” Machine Visionand Applications, vol. 16, no. 1, pp. 13-20, 2004.

[16] G. Ye, J.J. Corso, and G.D. Hager, “Gesture Recognition Using 3DAppearance and Motion Features,” Proc. IEEE Conf. ComputerVision and Pattern Recognition , pp. 160-166, 2004.

[17] M. Holte and T. Moeslund, “View Invariant Gesture RecognitionUsing 3D Motion Primitives,” Proc. IEEE Int’l Conf. Acoustics,Speech, and Signal Processing, pp. 797-800, 2008.

[18] T. Kirishima, K. Sato, and K. Chihara, “Real-Time GestureRecognition by Learning and Selective Control of Visual InterestPoints,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 27, no. 3, pp. 351-364, Mar. 2005.

[19] S. Mitra and T. Acharya, “Gesture Recognition: A Survey,” IEEETrans. Systems, Man, and Cybernetics, Part C: Applications and Rev.,vol. 37, no. 3, pp. 311-324, May 2007.

[20] A. Bobick and Y. Ivanov, “Action Recognition Using ProbabilisticParsing,” Proc. IEEE CS Conf. Computer Vision and PatternRecognition, pp. 196-202, 1998.

[21] A. Yilmaz, “Recognizing Human Actions in Videos Acquired byUncalibrated Moving Cameras,” Proc. IEEE Int’l Conf. ComputerVision, pp. 150-157, 2005.

[22] Y. Shen and H. Foroosh, “View-Invariant Action Recognition fromPoint Triplets,” IEEE Trans. Pattern Analysis and Machine Intelli-gence, vol. 31, no. 10, pp. 1898-1905, Oct. 2009.

[23] V. Parameswaran and R. Chellappa, “View Invariance for HumanAction Recognition,” Int’l J. Computer Vision, vol. 66, no. 1, pp. 83-101, 2006.

[24] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri,“Actions as Space-Time Shapes,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 29, no. 12, pp. 2247-2253, Dec. 2007.

[25] A.F. Bobick and J.W. Davis, “The Recognition of HumanMovement Using Temporal Templates,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 23, no. 3, pp. 257-267,Mar. 2001.

[26] B. Peng, G. Qian, and S. Rajko, “View-Invariant Full-Body GestureRecognition from Video,” Proc. Int’l Conf. Pattern Recognition,pp. 1-5, 2008.

[27] S. Eickeler, A. Kosmala, and G. Rigoll, “Hidden Markov ModelBased Continuous Online Gesture Recognition,” Proc. Int’l Conf.Pattern Recognition, pp. 1206-1208, 1998.

[28] J. Alon, V. Athitsos, Q. Yuan, and S. Sclaroff, “A UnifiedFramework for Gesture Recognition and Spatiotemporal GestureSegmentation,” IEEE Trans. Pattern Analysis and Machine Intelli-gence, vol. 31, no. 9, pp. 1685-1699, Sept. 2009.

[29] T. Starner, J. Weaver, and A. Pentland, “Real-Time American SignLanguage Recognition Using Desk and Wearable Computer BasedVideo,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 20, no. 12, pp. 1371-1375, Dec. 1998.

[30] H.-K. Lee and J. Kim, “An HMM-Based Threshold ModelApproach for Gesture Recognition,” IEEE Trans. Pattern Analysisand Machine Intelligence, vol. 21, no. 10, pp. 961-973, Oct. 1999.

[31] H.-D. Yang, A.-Y. Park, and S.-W. Lee, “Gesture Spotting andRecognition for Humanrobot Interaction,” IEEE Trans. Robotics,vol. 23, no. 2, pp. 256-270, Apr. 2007.

[32] C. Chu and I. Cohen, “Pose and Gesture Recognition Using 3DBody Shapes Decomposition,” Proc. IEEE Conf. Computer Visionand Pattern Recognition, pp. 69-78, 2005.

[33] B. Peng and G. Qian, “Binocular Full-Body Pose Recognition andOrientation Inference Using Multilinear Analysis,” Tensors inImage Processing and Computer Vision, S. Aja-Fernandez, R. de LuisGarcıa, D. Tao, and X. Li, eds., Springer, 2009.

[34] H. Sakoe, “Dynamic Programming Algorithm Optimization forSpoken Word Recognition,” IEEE Trans. Acoustics, Speech, andSignal Processing, vol. ASSP-26, no. 1, pp. 43-49, Feb. 1978.

[35] L.R. Rabiner, “A Tutorial on Hidden Markov Models and SelectedApplications in Speech Recognition,” Proc. IEEE, vol. 77, no. 2,pp. 257-286, Feb. 1989.

[36] A. Mccallum, D. Freitag, and F. Pereira, “Maximum EntropyMarkov Models for Information Extraction and Segmentation,”Proc. Int’l Conf. Machine Learning, pp. 591-598, 2000.

[37] J.D. Lafferty, A. McCallum, and F.C.N. Pereira, “ConditionalRandom Fields: Probabilistic Models for Segmenting and LabelingSequence Data,” Proc. Int’l Conf. Machine Learning, pp. 282-289,2001.

[38] C. Myers, L. Rabiner, and A. Rosenberg, “Performance Tradeoffsin Dynamic Time Warping Algorithms for Isolated WordRecognition,” IEEE Trans. Acoustics, Speech, and Signal Processing,vol. 28, no. 6, pp. 623-635, Dec. 1980.

[39] A. Pikrakis, S. Theodoridis, and D. Kamarotos, “Recognition ofIsolated Musical Patterns Using Context Dependent DynamicTime Warping,” IEEE Trans. Speech and Audio Processing, vol. 11,no. 3, pp. 175-183, May 2003.

[40] J. Lichtenauer, E. Hendriks, and M. Reinders, “Sign LanguageRecognition by Combining Statistical Dtw and IndependentClassification,” IEEE Trans. Pattern Analysis and Machine Intelli-gence, vol. 30, no. 11, pp. 2040-2046, Nov. 2008.

[41] T.G. Dietterich, “Machine Learning for Sequential Data: AReview,” Proc. Joint IAPR Int’l Workshop Structural, Syntactic, andStatistical Pattern Recognition, pp. 15-30, 2002.

[42] H.-D. Yang, A.-Y. Park, and S.-W. Lee, “Robust Spotting of KeyGestures from Whole Body Motion Sequence,” Proc. IEEE Int’lConf. Automatic Face and Gesture Recognition, pp. 231-236, 2006.

[43] S. Rajko, G. Qian, T. Ingalls, and J. James, “Real-Time GestureRecognition with Minimal Training Requirements and OnlineLearning,” Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion, pp. 1-8, 2007.

[44] K. Nickel and R. Stiefelhagen, “Visual Recognition of PointingGestures for Human-Robot Interaction,” Image and Vision Comput-ing, vol. 25, no. 12, pp. 1875-1884, 2007.

[45] B. Peng, G. Qian, and S. Rajko, “View-Invariant Full-Body GestureRecognition via Multilinear Analysis of Voxel Data,” Proc. Int’lConf. Distributed Smart Cameras, 2009.

[46] L.D. Lathauwer, B.D. Moor, and J. Vandewalle, “A MultilinearSingular Value Decomposition,” SIAM J. Matrix Analysis andApplications, vol. 21, no. 4, pp. 1253-1278, 2000.

[47] L. Elden, Matrix Methods in Data Mining and Pattern Recognition.SIAM, 2007.

[48] M.A.O. Vasilescu and D. Terzopoulos, “Multilinear Analysis ofImage Ensembles: Tensorfaces,” Proc. European Conf. ComputerVision, pp. 447-460, 2002.

[49] D. Vlasic, M. Brand, H. Pfister, and J. Popovi, “Face Transfer withMultilinear Models,” Proc. ACM SIGGRAPH, pp. 426-433, 2005.

[50] M.A.O. Vasilescu and D. Terzopoulos, “Tensortextures: Multi-linear Image-Based Rendering,” ACM Trans. Graphics, vol. 23,no. 3, pp. 334-340, 2004.

[51] M.A.O. Vasilescu, “Human Motion Signatures: Analysis, Synth-esis, Recognition,” Proc. Int’l Conf. Pattern Recognition, pp. 456-460,2002.

[52] J. Davis and H. Gao, “An Expressive Three-Mode PrincipalComponents Model of Human Action Style,” Image and VisionComputing, vol. 21, no. 11, pp. 1001-1016, 2003.

[53] C.-S. Lee and A. Elgammal, “Modeling View and PostureManifolds for Tracking,” Proc. IEEE Int’l Conf. Computer Vision,pp. 1-8, 2007.

PENG AND QIAN: ONLINE GESTURE SPOTTING FROM VISUAL HULL DATA 1187

Page 14: Online Gesture Spotting from Visual Hull Data

[54] S. Rajko and G. Qian, “Hmm Parameter Reduction for PracticalGesture Recognition,” Proc. IEEE Int’l Conf. Face and GestureRecognition, pp. 1-6, 2008.

[55] J. Shi, S. Belongie, T. Leung, and J. Malik, “Image and VideoSegmentation: The Normalized Cut Framework,” Proc. IEEE Int’lConf. Image Processing, pp. 943-947, 1998.

[56] H.A.L. Kiers, “An Alternating Least Squares Algorithms forParafac2 and Three-Way Dedicom,” Computational Statistics andData Analysis, vol. 16, no. 1, pp. 103-118, 1993.

Bo Peng received the BS degree in electricalengineering from Zhejiang University, Hang-zhou, China, in 2006. He is currently workingtoward the PhD degree in electrical engineeringat Arizona State University, Tempe. He was amember of Chu Kochen Honors College, Zhe-jiang University from 2002 to 2006. His researchinterests include human motion analysis, com-puter vision, and machine learning.

Gang Qian received the BE (Distinction) degreefrom the University of Science and Technologyof China (USTC), Hefei, China, in 1995. Hereceived the MS and PhD degrees in electricalengineering from the University of Maryland,College Park, in 1999 and 2002, respectively.He is an assistant professor in the School ofArts, Media and Engineering, and the School ofElectrical, Computer and Energy Engineering atArizona State University, Tempe. He was a

faculty research assistant (2001-2002) and a research associate (2002-2003) at the Center for Automation Research (CfAR) at the University ofMaryland Institute for Advanced Computer Studies. He has served onthe organizing/technical committees of a number of internationalconferences, including the 2008 and 2009 International Conference onImage Processing, and the 2006 International Conference on Image andVideo Retrieval. His current research includes computer vision andpattern analysis, sensor fusion and information integration, multimodalsensing and analysis of human movement and activities, human-computer interaction and human-centered interactive systems, andmachine learning for computer vision. He is a member of the IEEE andthe IEEE Computer Society.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

1188 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 6, JUNE 2011


Recommended