Online Gesture Spotting from Visual Hull DataBo Peng and Gang Qian, Member, IEEE
Abstract—This paper presents a robust framework for online full-body gesture spotting from visual hull data. Using view-invariant
pose features as observations, hidden Markov models (HMMs) are trained for gesture spotting from continuous movement data
streams. Two major contributions of this paper are 1) view-invariant pose feature extraction from visual hulls, and 2) a systematic
approach to automatically detecting and modeling specific nongesture movement patterns and using their HMMs for outlier rejection in
gesture spotting. The experimental results have shown the view-invariance property of the proposed pose features for both training
poses and new poses unseen in training, as well as the efficacy of using specific nongesture models for outlier rejection. Using the
IXMAS gesture data set, the proposed framework has been extensively tested and the gesture spotting results are superior to those
reported on the same data set obtained using existing state-of-the-art gesture spotting methods.
Index Terms—Online gesture spotting, view invariance, multilinear analysis, visual hull, hidden Markov models, nongesture models.
Ç
1 INTRODUCTION
HUMAN gesture recognition has received considerableattention in the past decade. To enable machines to
understand human gestures is critical for developingembodied human-computer interaction (HCI) systems toallow users to communicate with computers throughactions and gestures in a much more intuitive and naturalmanner than traditional interfaces through mouse clicksand keystrokes. Such systems have important applicationsin virtual reality [1], industrial control [2], healthcare [3], [4],computer games [5], human-robot interaction [6], andinteractive dance performance [7], [8].
A gesture recognition system is preferred to be nonin-trusive. Using body-worn sensors such as markers andinertial sensors in gesture recognition is cumbersome andsometimes movement-restraining. For this reason, videosensing has been widely applied in gesture recognition. Inpractice, gestures need to be simultaneously detected andrecognized from continuous movement data. This task iscommonly referred to as gesture spotting [9], [10].Furthermore, online gesture spotting is often desired forreal-time processing in which the recognition decision ismade using data up to the current observation withouthaving to wait for any future data.
In this paper, we present an online, video-based frame-work for view-invariant, full-body gesture spotting. Afterextracting view-invariant pose features using multilinearanalysis from visual hull data, hidden Markov models(HMMs) are trained for gesture recognition by using thesepose features as observations. The proposed method hasbeen extensively tested on the IXMAS gesture data set [11]and our results are superior to those reported on the samedata set in [11], [12], [13].
The outline of this paper is as follows: The rest of thissection reviews state-of-the-art video-based gesture recogni-tion and discusses relevant outstanding challenges. InSection 2, we briefly introduce multilinear analysis and thereduced-parameter HMM. In Section 3, key pose selection isdiscussed. In Section 4, we present the proposed pose featureextraction method as well as results on view-invarianceevaluation. Section 5 introduces the model learning andgesture spotting strategies and our proposed method forautonomous nongesture movement pattern detection andmodeling. Experimental results and performance analysisare provided in Section 6. Finally, in Section 7, we concludethe proposed framework and present future research plan.
1.1 Video-Based Gesture Recognition
Many video-based methods have been developed for hand[14], [15], [16], arm [17], [18], and full-body [6], [11] gesturerecognition. See [19] for a recent literature survey. Thesemethods can be roughly classified as the kinematic-based[5], [6], [15], [20], [21], [22], [23] and the template-basedapproaches [11], [12], [13], [18], [24], [25], [26]. Thekinematic-based approaches use articulated motion para-meters such as joint angle vectors [6], body-centered jointlocations [21], or body part positions [5], [15], [20] asfeatures for gesture recognition. The major weakness ofsuch approaches is that reliable articulated motion trackingfrom video is challenging and kinematic recovery is subjectto tracking failures.
The template-based approaches represent gestures usingfeatures extracted directly from image observations andthese approaches can be further split into the holistic and thesequential approaches. The holistic approaches, e.g., [11], [24],[25], represent an entire gesture as a spatio-temporal shapefrom which features are extracted for gesture recognition. Incontrast, the sequential approaches, e.g., [12], [13], [26],represent a gesture as a temporal series of features, one foreach time instant. Compared to the holistic approaches, thesequential approaches are more powerful in capturing andmodeling variations in gesture dynamics. When spottinggestures from continuous data, the sequential approachessimultaneously detect the gesture boundaries and evaluate
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 6, JUNE 2011 1175
. The authors are with the School of Arts, Media and Engineering,699 S. Mill Ave., Suite 395, PO Box 878709, Tempe, AZ 85281.E-mail: [email protected], [email protected].
Manuscript received 13 Dec. 2009; revised 17 June 2010; accepted 22 Oct.2010; published online 9 Nov. 2010.Recommended for acceptance by S. Sclaroff.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTPAMI-2009-12-0817.Digital Object Identifier no. 10.1109/TPAMI.2010.199.
0162-8828/11/$26.00 � 2011 IEEE Published by the IEEE Computer Society
gesture likelihoods for every incoming data frame. This isvery challenging for the holistic approaches without extradelay. For these reasons, in this paper, we focus on template-based, sequential gesture spotting and address the twopressing challenges: reliable view-invariant feature extrac-tion and accurate online spotting.
1.2 Challenges in Gesture Spotting
1.2.1 View-Invariant Recognition
Many HCI systems require view invariance so that gesturescan be spotted independent of the body orientation of thesubject. In our research, we focus on the body orientationangle about the vertical body axis perpendicular to theground plane and through the body centroid. Manymonocular gesture spotting approaches are view dependent[9], [10], [27], [28], [29], [30], [31], i.e., with known bodyorientation angles. Although valid in some scenarios suchas the automatic sign language interpretation, having toknow the body orientation presents an undesirable con-straint, hampering the flexibility and sometimes theusability of an HCI system.
To achieve view invariance, some template-based ap-proaches directly compare the input images against pre-stored image templates corresponding to different views.For example, a view-invariant gesture recognition methodis presented in [13] based on key pose matching and Viterbipath searching. Although promising results have beenreported in [13], such an exhaustive comparison strategyrequires a compromise between view-angle resolution, thenumber of key poses, and computational efficiency. Toaddress this issue, various forms of view-invariant posefeatures derived from visual hull data have been introduced[11], [17], [32]. Once extracted from the input data, thesepose features can be directly matched to the training featuretemplates for gesture recognition. During the extraction ofsuch features, the visual hull data are transformed to the 3Dshape context [17], [32] or the 3D motion history volume[11], and the data points are indexed in a body-centeredcylindrical coordinate system. The angular dimension in thecylindrical coordinate system is then suppressed to obtainpose features independent of the view point. Pose featuresextracted using these methods are view invariant since theorientation of the subject no longer affects the extractedfeatures. However, suppression of the angular dimensionmay cause information loss and introduce ambiguity ingesture recognition.
Pose feature extracted from a pair of input silhouettesusing multilinear analysis has been introduced for pose andgesture recognition [26], [33]. Such pose features are viewinvariant for the key poses used in tensor training.However, the view-invariance property cannot generalizeto the features of other nonkey poses. To summarize,reliable extraction of view-invariant pose features forgesture spotting is a challenge.
1.2.2 Online Gesture Spotting
Many gesture-driven HCI applications require real-timegesture spotting with minimum delay. Hence, it is desirablethat the gesture spotting be done online using only thecurrent and the past movement data. Accurate gesturespotting is also critical for successful HCI applications. Aninteractive HCI system provides feedback to the user in
response to a gesture command. It is nearly impossible forthe system to reverse any issued feedback without disturb-ing the user’s experience. Therefore, reliable online gesturespotting is a pressing challenge for gesture-driven HCI.
A number of pattern analysis frameworks have beenadopted for gesture spotting, including dynamic timewarping (DTW) [34], the HMMs [35], and conditional modelssuch as the maximum entropy Markov model (MEMM) [36]and the conditional random field (CRF) [37]. DTW wasdesigned to evaluate the similarity of two data segments andit has been used in speech and music recognition [38], [39] aswell as gesture recognition and spotting [9], [28], [40].Compared to other models, DTW lacks an effective way tomodel system dynamics and gesture variations.
MEMM [36] and CRF [37] have recently been applied topattern spotting with encouraging results [10]. MEMM andCRF are discriminative state-based models describing theconditional probability of the state sequence given theobservation. MEMM and CRF have been claimed to besuperior to generative models such as HMM because of theirallowance of long-distance dependency [37], [41]. Featuresfrom past and future observations are used to explicitlyrepresent long-distance interactions and dependency, lead-ing to more natural models than those from HMM.However, MEMM and CRF have certain limitations. Apartfrom the label bias problem of MEMM [37], CRF training ismuch more computationally expensive and converges muchslower than those of HMM and MEMM [37], [41]. Thescalability of CRF is another problem. CRF builds a unifiedmodel including all of the patterns to be recognized. As aresult, adding new patterns will require retraining the entiremodel and the previous model has to be discarded.
HMM [35] is a commonly used state-based, generativeframework for sequential pattern analysis. Using HMM, anobservation sequence is modeled as being emitted from thecorresponding hidden states. Gesture spotting has beendone using the HMM network [27], [30], [31], [42], [43], [44],which is formed by connecting in parallel gesture andnongesture HMMs. Usually, one or two nongesture HMMsare used to provide likelihood thresholds for outlierrejection. For instance, in [30], [31], [42], [44], a weakuniversal movement model has been used for adaptivethresholding. Representing nongestures using one or twoHMMs cannot effectively reject complex outliers, e.g., whenthey resemble portions of a gesture. Effective detecting andmodeling of nongesture movement patterns remains achallenge for HMM-based gesture spotting systems.
In summary, to develop online gesture spotting systemsfor gesture-driven HCI, two pressing challenges need to beaddressed: 1) robust view-invariant pose feature extraction,and 2) effective detection and modeling of nongesturemovement patterns for online gesture spotting. In thispaper, we systematically tackle these challenges and makethe following major contributions to online gesture spotting:
. A robust approach to view-invariant pose featureextraction from visual hull data using multilinearanalysis. Experimental results show that the proposedpose features are view invariant for both trainingposes and news poses unseen in training.
1176 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 6, JUNE 2011
. A systematic approach to detecting and modelingspecific nongesture movement patterns and usingtheir HMMs for outlier rejection. As shown in theexperimental results, using specific nongesturemodels noticeably improves gesture spotting byreducing false alarm rates and increasing therecognition reliability, without significantly sacrifi-cing the recognition rates.
Using the IXMAS data set [11], our proposed gesturespotting framework has been extensively tested and ourresults are superior to those reported on the same data set in[11], [12], [13]. Fig. 1 shows the block diagram of ourproposed system. This paper is extended based on [45] byincluding the nongesture detection and modeling algo-rithm, and the results on view-invariance evaluation andgesture spotting.
2 THEORETICAL BACKGROUND
2.1 Multilinear Analysis
In multilinear analysis, multimode data ensembles arerepresented by tensors as higher order generalization ofvectors and matrices (two-mode tensor). A data ensembleaffected by m factors is represented as an ðmþ 1Þ-modetensor T 2 IRNv�N1�N2...�Nm , where Nv is the length of a datavector and Ni, i ¼ 1; . . . ;m, is the number of possible valuesof the ith factor.
As a generalization of SVD on matrices, the high-ordersingular value decomposition (HOSVD) [46] can decom-pose a tensor A 2 IRN1�N2...�Nn as the following:
A ¼ S �1 U1 �2 U2 . . .�n Un; ð1Þ
where Uj 2 IRNj�N 0j ðN 0j � NjÞ are the mode matrices contain-ing orthonormal column vectors and S 2 IRN 01�N 02...�N 0n is thecore tensor. The mode matrices and core tenor are,respectively, analogous to the left and right matrices andthe diagonal matrix in SVD. Details on tensor algebra suchas the tensor multiplication and HOSVD can be found inAppendix A in the supplemental material for this paper,which can be found on the Computer Society DigitalLibrary at http://doi.ieeecomputersociety.org/10.1109/TPAMI.2010.199. Let uj;k be the kth row of Uj. It followsthat [47]
Ai1;i2;...;in ¼ S �1 u1;i1 �2 u2;i2 . . .�n un;in : ð2Þ
Let Aði1; . . . ; ij�1; :; ijþ1; . . . inÞ be the column vector contain-ing Ai1;...;ij;...;in , ij ¼ 1; . . . ; Nj. Then,
Aði1; . . . ; ij�1; :; ijþ1; . . . inÞ¼ S �j Uj �1 u1;i1 �2 u2;i2 . . .�j�1 uj�1;ij�1
�jþ1 ujþ1;ijþ1. . .�n un;in :
ð3Þ
The coefficients uk;ik in each mode can be considered as
independent factors contributing to the data point and the
interaction of these factors is governed by the tensor
S �j Uj. Due to this factorizing property, multilinear
analysis has been widely used to decompose data ensem-
bles into perceptually independent sources of contributing
factors for face recognition [48], 3D face modeling [49],
synthesis of texture and reflectance [50], movement analysis
and recognition from motion capture data [51], [52], and
image synthesis for articulated movement tracking [53].
2.2 Reduced-Parameter Hidden Markov Models
Currently, HMM [35] is the primary tool for sequential
modeling and inference. When using traditional HMMs for
gesture recognition, usually a large number of parameters
need to be trained. For instance, an n-state HMM has Oðn2Þstate transition parameters to be learned from training data.
Learning a large set of model parameters presents an
outstanding challenge to applications where only limited
training data are available.In [54], the reduced-parameter HMM has been proposed
to address this challenge. The reduced HMM reduces the
number of state transition parameters toOðnÞ. It improves the
computational efficiency of the inference, allowing the
number of states to increase while preserving real-time
recognition. The reduced HMM is also trained using the
expectation-maximization (EM) algorithm. Compared to the
traditional HMMs, due to the reduced parameter size, the
computational complexity for reduced HMM training is
much lower and fewer training samples are required [54].
Because of these advantages, in our proposed gesture
spotting framework we have adopted the reduced HMM to
represent gesture and nongesture movement patterns. On the
other hand, due to the unique parameterization of the state
transition probabilities, the reduced HMM can exactly
represent only a subset of the standard left-to-right HMM.
In our research, this limitation does not create noticeable
issues in spotting gestures from the IXMAS data set,
indicating that the related movement patterns can be well
modeled using the reduced HMM. See Appendix B in the
supplemental material for this paper, which can be found on
the Computer Society Digital Library at http://doi.ieee
computersociety.org/10.1109/TPAMI.2010.199, for more de-
tails on the reduced HMM and [54] for a complete treatment.
3 KEY POSE SELECTION
In our proposed approach, key poses are selected from the
gesture vocabulary for gesture spotting. Good key poses are
quite different from each other to avoid ambiguities in pose
features. In our approach, key pose candidates are first
detected based on the motion energy, and then they are
clustered to locate the key poses as the cluster centers. In
our experiments, the visual hull data in the IXMAS data set
were directly used.Let Vt be the visual hull at time t and Vtði; j; kÞ be the
value of the voxel at location ði; j; kÞ. Define the difference
of visual hull F t:
PENG AND QIAN: ONLINE GESTURE SPOTTING FROM VISUAL HULL DATA 1177
Fig. 1. The block diagram of the proposed gesture spotting framework.
F tði; j; kÞ ¼ 1; Vtði; j; kÞ ¼ 0 andXtþW=2
�¼t�W=2
V�ði; j; kÞ > 0
0; otherwise;
8><>:
ð4Þ
where W is the width of a time window. Given a movement
sequence of L frames, the “motion energy” at time t,
W þ 1 � t � L�W , is defined as the number of nonzero
voxels in F t, i.e.,
EðtÞ ¼Xi;j;k
F tði; j; kÞ: ð5Þ
In our approach, the key pose candidate is automatically
detected at the local extrema of EðtÞ of the training gesture
data. These key pose candidates are further clustered to
eliminate repetitive poses.Pose clustering requires interpose distance measure. To
suppress distance caused by changes in body shapes and
gesture execution locations, normalization such as centering
and rescaling is applied to the visual hull data, and the
normalized visual hull is used in such distance computa-
tion. Details on the visual hull normalization can be found
in the Appendix C in the supplemental material for this
paper, which can be found on the Computer Society Digital
Library at http://doi.ieeecomputersociety.org/10.1109/
TPAMI.2010.199. Define the distance between two normal-
ized voxel data V and V0:
dV ðV;V0Þ ¼k V � V0 kk V \ V0 k ; ð6Þ
where k � k is the cardinality operator which returns the
number of valid (nonzero) voxels. The intersection (\)
operation is carried out by treating the binary visual hulls as
logical data arrays.Assume that V1 and V2 are the normalized visual hulls of
two poses p1 and p2, respectively. To minimize the impact of
view angles, we define dP ðp1; p2Þ, a view-independent
distance between two poses p1 and p2:
dP ðp1; p2Þ ¼ min�dV ðV1; RðV2; �ÞÞ; ð7Þ
where RðV2; �Þ is the visual hull obtained by rotating V2
counterclockwise about its vertical body axis with angle �.
In practice, given V1 and V2, dP ðp1; p2Þ is found through
exhaustive search over � on a uniform grid over ð0; 2��.Using dP ð�; �Þ, the distance matrix of the candidate key poses
is computed. Then, normalized-cut [55] is used to cluster
the candidate key poses and the resulting cluster centers aretaken as the final key poses.
4 POSE FEATURE EXTRACTION USING
MULTILINEAR ANALYSIS
The visual hull of a human pose is mainly affected by threefactors: the body shape of the subject, the joint angleconfiguration (poses), and the body orientation. In ourresearch, we concentrate on the pose and orientation factors.Concerning the body shape factor, although the visual hullnormalization can reduce its influence to certain extent, asshown by the experimental results in this section, differentbody shapes do introduce relatively large variations to thepose features. This implies that to obtain accurate gesturespotting, the testing subject needs to share a similar bodyshape with at least one of the training subjects. In our futurework, this limitation will be addressed by developing posefeatures invariant to both view and body shape. As shown inFig. 2, the proposed pose feature is obtained by projecting aninput visual hull onto a pose feature space using the coretensor obtained via pose tensor decomposition.
4.1 Pose Tensor Decomposition Using HOSVD
Given selected key poses, a pose tensor can be formed usingtheir normalized voxel data in different orientations asshown in Fig. 3a. Details on the pose tensor formulation inour experiments are presented in Section 6.1. This posetensor can be decomposed using HOSVD to extract the coretensor and mode matrices, as described in Section 2.1. Inour proposed framework, we do not conduct dimensionreduction in any of the modes. According to (1), the posetensor A 2 IRNv�No�Np can be decomposed into
A ¼ S �1 Uv �2 Uo �3 Up; ð8Þ
where S is the core tensor of the same size as A andUv 2 IRNv�Nv , Uo 2 IRNo�No , and Up 2 IRNp�Np are, respec-tively, voxel, orientation, and pose mode matrices. In ourapproach, we only calculate Uo and Up, and the voxel modematrix Uv is combined with S. Thus,
A ¼ D�2 Uo �3 Up; ð9Þ
where D ¼ S �1 Uv. According to (3)
Að:; i; jÞ ¼ D �2 uo;i �3 up;j; ð10Þ
where Að:; i; jÞ is the visual hull vector corresponding topose j in orientation i. The vectors uo;i and up;j are,respectively, the ith row of Uo (i.e., the coefficient vector of
1178 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 6, JUNE 2011
Fig. 2. The diagram of the proposed pose feature extraction process. Fig. 3. (a) The structure of the pose tensor. (b) Examples of reshapedcolumns of the core tensor D.
orientation i ) and the jth row of Up (i.e., the coefficientvector of pose j). Fig. 3b shows four sample columns of Dremapped onto a cubic space.
Given a new visual hull z, its pose feature vp andorientation feature vo are found by solving a bilinear equation
z ¼ D�2 vo �3 vp: ð11Þ
Intuitively, solving vp and vo from z can be illustrated asa rank-constrained basis projection process. Recall that thecore tensor D is of size Nv �No �Np. Intuitively, D canbe considered to be an No �Np array, with each arrayelement being an Nv � 1 vector. These array elements canbe viewed as a set of basis vectors. Given the Nv � 1observation vector z, solving for vo and vp is equivalent tofinding a linear projection of z onto these basis vectors thatoptimizes the minimum-reconstruction-error criterion andalso satisfies the rank constraint: the resulting projectioncoefficients must form a rank-1 No �Np matrix given byvov
Tp . In other words, the projection coefficient for the basis
vector at the ith row jth column in D must be equal tovoðiÞvpðjÞ, 81 � i � No; 1 � j � Np.
4.2 Pose Vector Extraction Using ALS
The alternating least-squares (ALS) [56] algorithm is oftenused to iteratively solve the bilinear equation (11). Let vðnÞo bethe estimated orientation feature in the previous iteration.Then, D can be flattened into a matrix CðnÞo ¼ D�2 vðnÞo .Inserting CðnÞo into (11) leads to
z ¼ CðnÞo vp: ð12Þ
Thus, the current pose feature estimate vðnþ1Þp can be found
by solving the linear system (12). Similarly, using thecurrent pose feature vðnþ1Þ
p , the orientation feature vo can beupdated by solving a similar linear system:
z ¼ Cðnþ1Þp vo; ð13Þ
where Cðnþ1Þp ¼ D�3 vðnþ1Þ
p . Given the initial value vð0Þp orvð0Þo , vo and vp can be iteratively updated by alternatelysolving (12) and (13) until convergence.
Initialization is critical for ALS. In our research, we haveadopted the following initialization strategy. First, all of therow vectors fuo;igNo
i¼1 of Uo are used as the initial values forvo. For each uo;i, the corresponding pose vector vp;i isobtained by solving (12) only once. From fvp;igNo
i¼1, the posevector yielding the smallest distance to one of the standardposes fup;igNp
i¼1 is then chosen as vð0Þp to initialize ALS:
vð0Þp ¼ arg maxvp;i
maxj
vp;i � up;jkvp;ik � kup;jk
; ð14Þ
where i ¼ 1; . . . ; No is the orientation index and j ¼1; . . . ; Np is the pose index.
4.3 Evaluation of View Invariance
It is important to systematically evaluate the view invarianceof the proposed pose features. In our research, we haveperformed a series of evaluation studies using data from theIXMAS gesture recognition data set [11]. The IXMAS dataset contains calibrated multiview silhouette and visual hulldata from 12 subjects performing 14 daily actions. For eachsubject, three movement trials were included in the data set,
each containing various execution of all 14 actions. To becomparable, in our experiments, we have used data from thesame 10 subjects and 11 actions (Table 10) as those in [11]and [12]. The pose tensor used in these studies was formedusing data from one of the 10 subjects. Details on formingthis pose tensor are given in Section 6.1.
In the first study, pose features were extracted from thenormalized voxel data corresponding to two poses indifferent orientations. Figs. 4a and 4b, respectively, showthe normalized voxel data corresponding to a key pose and anonkey pose in 16 orientations. These testing data have beenselected from another subject different from the one used forpose tensor construction. Figs. 4c and 4f show the corre-sponding pose features obtained using multilinear analysis.It can be seen that these pose features are invariant to bodyorientations for both the key pose and the nonkey pose.
In the second study, we examined the robustness of theproposed pose features in the presence of visual hull errors.In this study, the protrusion errors and the partial occlusionerrors were added to the testing visual hull data used in theprevious study (Figs. 4a and 4b). Protrusion errorscorrespond to the uncarved large blocks of backgroundvoxels missed during the visual hull extraction. To add aprotrusion error to a visual hull, we first select a protrusionsphere with a random center in the background and aradius of three voxels (a 10th of the side length of thenormalized visual hull). To realistically synthesize aprotrusion error, a valid protrusion sphere is required tooverlap with the visual hull in a volume less than half of thesphere. Otherwise, another random sphere will be selected
PENG AND QIAN: ONLINE GESTURE SPOTTING FROM VISUAL HULL DATA 1179
Fig. 4. Voxel data of (a) a key pose and (b) a nonkey pose in 16 bodyorientations and their corresponding pose vectors in (c) and (f),respectively. Subplots (d) and (g) show the pose features extractedfrom the voxel data corrupted by protrusion errors corresponding to thekey pose data in (a) and nonkey pose data in (b), respectively.Subplots (e) and (h) show the pose features extracted from the noisydata, with partial occlusions corresponding to the key pose and nonkeypose data, respectively.
until a valid protrusion sphere is found. Once such a sphereis found, all of the voxels inside the sphere are consideredprotrusion voxels, with their values set to 1.
Partial occlusion errors correspond to the large blocks offoreground voxels wrongly carved during the visual hullextraction. To add a partial occlusion error to a visual hull, wefirst select an occlusion sphere with a random center in theforeground (on the subject) and a radius of three voxels. Then,all the voxels inside this occlusion sphere are treated to beoccluded with their values set to 0. Fig. 5 shows examples ofan original visual hull (Fig. 5a) and its two noisy versionscorrupted by protrusion (Fig. 5b) and partial occlusion(Fig. 5c) errors. Pose features extracted from the noisy dataare shown in Figs. 4d and 4g (extracted from the noisy datawith protrusion errors) and Figs. 4e and 4h (from the noisydata with partial occlusions). It can be seen from these figuresthat the pose features extracted from the noisy data stilllargely resemble those from the original data. Hence, theview-invariance property still holds, in general, when thevisual hull data are corrupted by protrusion and partialocclusion errors.
In the third study, we have further evaluated the viewinvariance of the proposed pose features in a quantitativemanner. In this study, we examined the interorientationsimilarity of the pose features corresponding to the samepose in different orientations. For each subject, werandomly selected 100 frames of visual hull data from oneof the three movement trials of the subject in the IXMASdata set. A total of 1,000 frames from 10 subjects wereselected. Among these frames, 318 frames have small(< 0:6) distances (as defined in (7)) to some of the keyposes. These frames are referred to as the key pose frames andthe rest the nonkey pose frames. Each visual hull frame wasrotated to 16 facing directions and the corresponding posefeatures as well as their pair-wise similarities wereobtained. The minimum value of these similarities, definedas the minimum interorientation similarity (MIOS), is used toquantify the degree of view invariance of the pose featuresfor this visual hull frame. The 10-bin MIOS histograms forboth the key pose and the nonkey pose frames are shown inFigs. 6a and 6b, respectively. As a context, Fig. 6e shows thehistogram of the pair-wise interframe similarities betweenthe 1,000 testing visual hull frames. From Fig. 6, it can beseen that the MIOS values are more than 0.9 for all the keypose frames and the majority (619 out of 682) of nonkeypose frames. Only a small percentage (6.3 percent) of thetesting frames have low MIOS values. Hence, we experi-mentally verified the view-invariance property of theproposed pose features. The low MIOS values (i.e.,discrepancy in pose features of the same pose in differentorientations) are mainly due to the existence of multiplesolutions when using ALS to find the pose features. In ourgesture spotting experiments using the IXMAS data set, thisissue did not present a significant problem.
The same study has been repeated using noisy data. Toeach voxel data used in the study, protrusion errors wereadded with a probability of 1
3 and partial occlusion errorswith a probability of 1
3 , and there is a probability of 13 that
the data are unchanged. Using such noisy data, the MIOShistograms of the key pose and the nonkey pose frameswere obtained as shown in Figs. 6c and 6d, respectively. It isclear that the voxel errors only slightly affected the viewinvariance of the pose feature.
To examine the impact of different body shapes, we havecompared pose features extracted from the same pose acrossdifferent people. Fig. 7a shows the 10 visual hulls of a poseperformed by the 10 IXMAS subjects. These testing datashare similar views so that the impact of body shape on thepose feature can be studied independently from that ofview. The corresponding pose features are given in Fig. 7b.It can be seen that although generally similar to each other,these pose features across different subjects are lessconsistent than those over different views as shown in Fig. 4.
We have further examined the interpeople similarity of theproposed pose feature. In our research, 25 key poses (Fig. 11)were selected from the IXMAS data set for gesture spotting. Inthis study, for each key pose, its voxel data from the 10 IXMASsubjects were first aligned in the same body orientation. Afterextracting the pose features from the aligned voxel data, theaverage pair-wise interpeople similarities of these posefeatures for the given key pose were calculated. This processwas repeated for all the 25 key poses. Once the 25 averageinterpeople similarities were computed, one for each keypose, their histogram was obtained as shown in Fig. 7c. Thishistogram has 10 bins with a total count of 25, the number ofdata points. As shown in Fig. 7c, five bins sit to the right of 0.5,with a total count of 16. This implies that out of the 25 keyposes, 16 of them have the average interpeople similaritygreater than or equal to 0.5. It is clear that changes in bodyshapes do introduce relatively large variations. For gesturespotting, this implies that the body shape of the testingsubject needs to be similar to one of the training subjects,which is certainly a limitation of the proposed frameworkand is to be addressed in our future work.
1180 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 6, JUNE 2011
Fig. 5. Examples of visual hull errors. (a) The original voxel data. (b) Thenoisy data corrupted by a random protrusion error (red sphere). (c) Thenoisy data with a random partial occlusion error (green sphere).
Fig. 6. Similarity histograms of pose features. (a)-(d) Histograms ofinterorientation similarities for key pose data, nonkey pose data, noisykey pose data, and noisy nonkey pose data, respectively. (e) Similarityhistogram of the pose features from 1,000 visual hull frames.
Fig. 7. (a) Voxel data of a pose performed by 10 subjects. (b) Thecorresponding pose vectors. (c) The histogram of the average inter-people similarities.
4.4 Analysis of Orientation Coefficient Vectors
When extracting the pose feature from a new observation,the corresponding orientation feature vo is also availablefrom the ALS solution. The orientation features of differentposes in the same body orientation are close to each other.For example, Fig. 8b shows similar orientation features ofthe 25 key poses roughly aligned in the same bodyorientation (Fig. 8a). Using these orientation features, thebody orientation angle can be estimated through manifoldlearning as shown in [33].
5 GESTURE SPOTTING USING HMM
Using the proposed pose features as observations, gesturescan be spotted from continuous data by using an HMMnetwork [27], [30]. As illustrated in Fig. 9, an HMM networkis formed by a number of parallel branches connecting thenonemitting starting and end states through movementHMMs, including gesture and nongesture models. Thesame gesture models (GM) are used for gesture spottingfrom continuous data and gesture recognition from pre-segmented data. The nongesture models contain a generalgarbage model (GGM) and additional HMMs representingspecific nongesture movement.
5.1 Model Learning
In the proposed framework, the reduced HMM introduced inSection 2.2 has been used to create gesture models. Eachemitting state is modeled as a Gaussian mixture with adiagonal covariance matrix. Assume that there areN gesturesin the gesture vocabulary G. For each gesture g 2 G, thecorresponding HMM with model parameter set �g is learnedusing the EM algorithm from the associated training samplesmanually segmented from training data. Once these gesturemodels are learned, they can be used to classify presegmentedmovement data. Let O ¼ fO1; O2; . . . ; Otg be a sequence ofpose features obtained via multilinear analysis from a gesturemovement segment. This movement segment is then classi-fied as the gesture g� yielding the maximum likelihood, i.e.,
g� ¼ arg maxg2G
pðOj�gÞ: ð15Þ
The number of states in these HMMs is determined by crossvalidation and linear search. The same parameters are alsoused for gesture spotting.
5.1.1 Nongesture Models
Effectively rejecting nongesture movements is a keychallenge for gesture spotting. In [27] and [30], one or twogeneral garbage gesture models are used to model all the
nongesture movement patterns. In our proposed frame-work, a garbage gesture model is also used to modelgeneral nongesture movement sequences. It has a singleemitting state, with a flat probability distribution functionover the entire observation space. The single state can loopback to itself (nongesture continues) or exit (nongestureends). Applying this garbage model is equivalent to settinga threshold on the normalized log-likelihood of the spottedgestures, where this threshold is simply given by thelogarithm of the flat probability of the emitting state of thegarbage model.
As discussed earlier, using one or two general garbagemodels is not effective in outlier rejection in gesturespotting. To tackle this challenge, in addition to a generalgarbage gesture model, we have also deployed a number ofspecific nongesture models, including automatically identi-fied and manually specified nongesture models. The goal isto represent specific nongesture movement patterns in thetraining data, and then use them to reject similar outlierpatterns in gesture spotting. Some nongesture patterns aremanually picked, e.g.,
. repetitive intergesture patterns (such as stand still),
. false gestures that are similar to the true ones, and
. movement patterns shared by two or more gestures.
Such patterns are common in the training data and theirHMMs are trained identically as the gesture HMMs.
In addition, we have developed an approach (Fig. 10a) toautomatically detecting and modeling nongesture move-ment patterns. First, the training movement sequences aresegmented into element pieces by finding the minima of themotion energy defined in Section 3. Element pieces thatoverlap with training segments of the gestures andmanually specified nongestures (if there are any) areeliminated, leaving only unused element pieces corre-sponding to remaining nongesture movement patterns.Then, the similarity matrix of these element pieces is foundusing DTW [34] based on the euclidean distance. Accordingto this similarity matrix, the nongesture training data aregrouped into a number of clusters using normalized-cut[55], one cluster for each automatically detected nongesturemodel. The number of clusters is preset manually. Finally,the element pieces close to the cluster centers are taken asthe training samples to train the corresponding nongestureHMM. In practice, the number of nongesture models can beflexibly tuned according the specific applications.
5.2 Gesture Spotting
Using both gesture and nongesture HMMs, gesture spottingcan be achieved by evaluating the joint probability of the
PENG AND QIAN: ONLINE GESTURE SPOTTING FROM VISUAL HULL DATA 1181
Fig. 8. Voxel data of 25 poses performed in (a) the same orientation and(b) their orientation vectors. Fig. 9. The HMM network used in gesture spotting.
observation sequence and the path of state transition in anHMM network. During gesture spotting, at each time t, oneframe of pose feature ot is input to the HMM network. Let qtbe the hidden state at time t, Sn be the nth hidden state inthe HMM network, S be the collection of all the states of theHMM network, and Ot be the observation sequence fromthe beginning of the movement piece up to time t. Let
�tðSnÞ ¼ maxq1;...;qt�1
pðOt; q1; . . . ; qt�1; qt ¼ Snj�Þ; ð16Þ
be the joint probability of Ot and the optimal state path tothe current state qt ¼ Sn. In (16), � is the parameter set ofthe entire HMM network. Using the Viterbi algorithm,�tðSnÞ can be computed based on �t�1ðSÞ, 8S 2 S, and thecurrent observation in an incremental manner.
Using the reduced HMM, each gesture model contains anonemitting end state. Reaching this end state indicates theexecution of the corresponding gesture. Let Eh be the endstate of HMM h. Let G be the gesture vocabulary and F bethe nongesture set. At time instant t, if the end probabilityof a gesture g� is the largest among all the gestures andnongestures, g� is spotted, i.e.,
g� ¼ arg maxh2G[F
�tðEhÞ and g� 2 G: ð17Þ
Once a gesture is detected, its starting time can be easilybacktracked along the most probable path.
This preliminary spotting result is further refined. Alength constraint and a likelihood constraint are set up toreject outliers. In our experiments, we require the length of aspotted gesture to be shorter than 50 frames (since the lengthof ground-truth gesture segments is in the range of 10 to35 frames), and the likelihood of the spotted gesture segmentto be larger than 10�80 (the majority of the training gesturelikelihoods are in the range of 10�20 to 10�50). Movementsegments satisfying both constraints are admitted as gesturesegment candidates. Then, temporal consistency is further
used to stabilize the gesture spotting results. To be specific,only when the gesture segment candidates sharing the samestarting frame are continually detected T (a prechosenthreshold) times without other candidates detected in themiddle is a final spotting decision made. This gesturespotting scheme is summarized in Fig. 10b. The spottingresult is a series of spotted gesture segments marked by theirbeginning and end frame numbers together with theirgesture labels and likelihood values.
6 EXPERIMENTS AND ANALYSIS
Our proposed gesture spotting framework has been testedusing the IXMAS gesture data set [11] and our results aresuperior to those reported in [11], [12], [13] on the same dataset. To be comparable, in our experiments, we only useddata from the same 10 subjects and 11 actions (Table 10) asthose in [11] and [12].
6.1 Key Pose Selection and Pose Tensor Formation
Using the proposed key pose selection method, as shown inFig. 11 in their most distinguishable views, 25 key poseswere selected from a movement trial of Florian (one of the10 subjects). The visual hull data of these key frames fromFlorian’s data were then used to form the pose tenor. Theywere first normalized to the size of 30� 30� 30, and thenmanually aligned to approximately share the same facingdirection. To obtain the training pose tensor, each alignedvisual hull was further rotated about its vertical body axisto generate voxel data facing 16 directions, evenly dis-tributed from 0 to 15
8 � with 18� between adjacent views.
Their mean is also subtracted to center the voxel data. Theresulting centered voxel data from all key poses were thenarranged into the three-mode 303 � 16� 25 training posetensor. This pose tensor was further decomposed to obtainthe core tensor using HOSVD. Given the core tensor, thepose feature of an input visual hull frame can be foundusing ALS.
6.2 Training and Testing Schemes
Following [11] and [12], we have evaluated our proposedapproach through cross validation. In each training andtesting cycle, the movement data and associated pose featuresof nine of the 10 subjects in the IXMAS data set were used asthe training data and those of the remaining subject were thenused for testing. This procedure was repeated 10 times so thateach subject was used once as the testing subject. The finalresults reported in Sections 6.3 and 6.4 were based on thecumulative results obtained in all 10 training-testing cycles.
6.2.1 Movement Signature and Ground-Truth Data
Following [11] and [12], in our research, an elementalmovement was selected for each action as the correspondingrepresentative signature. For example, for the “checkwatch” action, the “raise hand” motion was selected asthe movement signature. The 11 action signatures form the
1182 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 6, JUNE 2011
Fig. 10. (a) Automatic detection and HMM training of nongesturemovement patterns. (b) The flowchart of the gesture spotting algorithm.
Fig. 11. The 25 key poses selected from the IXMAS data set.
gesture vocabulary in our experiments. All of the move-ment segments corresponding to the action signatures weremanually identified from the IXMAS data set and used astraining and ground-truth data. They are referred to as thegesture segments in this section.
6.2.2 Training
In each training-testing cycle, on average about 283 gesturesegments were used to train the 11 gesture HMMs. To beconsistent with [11], for gestures executed multiple times ina training movement trial, only one of them was (randomly)selected to be included in the training set.
For gesture spotting, nongestures HMMs also need to betrained. The training data of the manually selected non-gestures were hand picked and those of the automaticallydetected nongestures were obtained during nongesturemovement detection as discussed in Section 5.1.1.
6.2.3 Testing
In each training-testing cycle, once the 11 gesture HMMswere learned, the proposed framework was tested firstusing presegmented data. The testing data were from thegesture segments of the testing subject. To be consistentwith [11], for gestures executed multiple times in a singlemovement trial, only one of them was used in testing. Onaverage, about 31 testing movement segments were used ina testing cycle. In each testing cycle for gesture spotting, allthree complete movement trials of the testing subject wereused as testing data.
6.3 Gesture Recognition Using Presegmented Data
To evaluate gesture recognition using presegmented testingdata, the recognition and the false alarm rates for both theindividual gestures and entire gesture vocabulary wereobtained according to Table 1 and are given in Table 2. Thecorresponding confusion matrix is shown in Table 3. Table 2also includes the gesture recognition results on the samedata set reported in [11] and [12]. It can be seen that ourresults using the proposed pose features are slightly betterthan the existing results.
A valid question to ask is whether it is necessary toextract pose features for gesture recognition. Since the 25 keyposes have been identified, a possible brute-force gesturerecognition approach is to first match an input frame to oneof the key poses to obtain a discrete-observation model, andthen HMMs can be used for gesture recognition. In our
research, we have implemented this simple brute-forcediscrete-observation method and have compared it with ourmethod using pose features. In this discrete method, aninput frame is assigned with the ID of its closest key poseneighbor according to the interpose distance measure (7)using 16 search angles on a uniform grid over ð0; 2��. Then,the reduced HMMs using the pose ID as the discreteobservations are used for gesture modeling and recognition.The results are in Table 2. It is clear that such a simplediscrete method performs much worse than the proposedmethod. Hence, it is valid to extract the proposed view-invariant pose features for reliable gesture recognition.
6.4 Gesture Spotting Using Continuous Data
6.4.1 Evaluation Criteria
To evaluate the proposed gesture spotting framework, wehave analyzed and compared our spotting results againstthe ground-truth data in a number of aspects, including thetemporal matching accuracy, the recognition and falsealarm rates, and the reliability of recognition.
Let FbðiÞ and FeðiÞ be the beginning and end framenumbers of the ith true gesture segment in the testing data.
PENG AND QIAN: ONLINE GESTURE SPOTTING FROM VISUAL HULL DATA 1183
TABLE 1Notations for Evaluation Using Presegmented Data
TABLE 2Gesture Recognition Results Using Presegmented Data
TABLE 3Confusion Matrix (in Percent)
The length of this segment is LGT ðiÞ ¼ FeðiÞ � FbðiÞ þ 1. LetSbðiÞ and SeðiÞ be the beginning and end frame numbers of aspotted gesture segment. Define the absolute temporal
matching scoreOAðiÞ as the number of the overlapped framesbetween the spotted and ground-truth gesture segments:
OAðiÞ ¼ minfSeðiÞ; FeðiÞg �maxfSbðiÞ; FbðiÞg; ð18Þ
and the absolute temporal matching score ORðiÞ as the ratioof OAðiÞ to the length of the true segment:
ORðiÞ ¼ OAðiÞ=LGT ðiÞ: ð19Þ
When ORðiÞ is larger than a prechosen threshold �
(0 < � � 1), the spotted gesture is considered temporallymatched to the ground-truth segment. In our experiments,the default value of � was 0.5.
Once a spotted gesture segment is temporally matched toa ground-truth segment, their gesture labels are compared.If they share the same gesture label, correct spotting occurs.Otherwise, a substitution error occurs. If a gesture segmentis not temporally matched to any of the ground-truthsegments with respect to the prechosen �, an insertion erroroccurs. On the other hand, if a ground-truth gesture segmentis not matched to any spotted gesture segment, a missingerror occurs. Revelent notations are given in Table 4. When atrue gesture segment is not correctly spotted, there are twopossibilities: either not detected (missing) or detected butmisrecognized (substitution). Therefore, the number ofunspotted gesture segments ET ¼ NT �NC ¼ ES þ EM .Table 5 lists the indicators used in our analysis to measurethe spotting and temporal matching accuracies.
6.4.2 Experimental Results
To examine the impact of using specific nongesture modelson gesture spotting, we have experimented with variousHMM combinations. We first started with the gesture modelsand the general garbage model. Then, the automaticallydetected nongesture models (ANGM) and the manuallyselected nongesture models (MNGM) were gradually added.
The gesture spotting accuracy and temporal matchingaccuracy (when � ¼ 0:5) using various models are given inTables 6 and 7, respectively. From Table 6, it can be seenthat using more nongesture models greatly reduced theinsertion errors without significantly diminishing correctrecognition. Consequently, the reliability of the spottedgestures greatly increased. It can also be seen from Table 6that when more ANGMs were used, adding MNGMs only
slightly improved the spotting accuracy. From Table 7, wecan see that different gesture model combinations had onlyvery slight impact on the temporal matching accuracy andthe resulting temporal matching accuracy measures are allat reasonable levels.
To examine the influence of the temporal matchingthreshold � on gesture spotting, results using different valuesof � have been obtained as shown in Table 8. The correspond-ing HMM combinations are MNGM+15ANGM+GM+GGMand 15ANGM+GM+GGM. It is clear from Table 8 that when �decreased, both the recognition rate and reliability increasedand meanwhile the insertion errors and the false alarm ratereduced. This is because when � is low, more spotted gesturesegments are matched to true gesture segments. Moreover,Table 8 also indicates that � affected the distributions of thesubstitution and the missing errors. Recall that ET ¼ ES þEM . It can be seen from Table 8 that when � was decreasing,both ET and EM were decreasing while ES was increasing.This is because reducing � allows more spotted gestures to betemporarily matched to true gesture segments (thus reducingEM ). Meanwhile, not all of the newly admitted segments havethe same gesture labels as the true gesture segments, thusleading to increased ES .
To examine how the proposed method can spot gesturesfrom multiple testing subjects, we have run tests usingIXMAS subjects 6 to 10 for testing and the rest for training.Using all the nongesture models with � ¼ 0:5, the resultingRR is 74.54 percent and FAR 10.38 percent, which arecomparable to those obtained using one testing subject (RR:80.14 percent, FAR: 10.16 percent, row 6 of Table 9).
6.4.3 Comparison with Existing Methods
Gesture spotting results using continuous streams from theIXMAS data set have been reported in [11] and [13]. Table 9shows the comparison between our results and those in [11].It can be seen that our method using different � consistentlyachieved higher recognition rates and lower false positiverates than those in [11]. Moreover, the way we evaluate thegesture spotting accuracy is much stricter and morecomplete than that used in [11]. Differently from our methodwhere gestures were spotted directly from continuous
1184 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 6, JUNE 2011
TABLE 4Notations for Gesture Spotting Evaluation
TABLE 5Performance Indicators for Gesture Spotting (Summations Are
over the Correctly Spotted Gesture Segments)
movement data, in [11], Weinland et al. first segmented thetesting movement trial data using motion energy, and thenclassified the resulting movement segments as eithergestures or nongestures. To compute the recognition andfalse alarm rates, ground truth was obtained manually on
top of the segmented data. Consequently, the spottingresults in [11] do not take into account the segmentationerrors. For example, if the segmentation algorithm wronglygrouped two gestures, the combined segment will be treatedas a nongesture movement segment in the ground truth andthis segmentation error will not be reflected in the spottingresults. In practice, segmentation errors also lead to gesturespotting errors. Obviously, it is suboptimal in gesturespotting evaluation to obtain ground truth purely based onthe segmented data and omit segmentation errors inrecognition rate computation. The true gesture recognitionrate (also used in our research) should be the number ofcorrectly spotted gesture (NC) divided by the number of truegestures (NT ). On the other hand, when the errorsintroduced by wrong segmentation are not considered, theresulting recognition rate is then NC divided by NS , thenumber of correctly segmented gestures. Since NS is alwaysless than or equal to NT , the recognition rate withoutcounting the segmentation errors will be always higher thanor at most equal to the actual recognition rate. Even using astricter method for recognition rate computation, it can beseen in Table 9 that our proposed method consistentlyoutperformed the method in [11].
PENG AND QIAN: ONLINE GESTURE SPOTTING FROM VISUAL HULL DATA 1185
TABLE 8Gesture Recognition Accuracy with Various � Values and Two HMM Network Models:
15ANGM+GM+GGM/MNGM+15ANGM+GM+GGM
TABLE 9Comparison of Gesture Recognition Accuracy
TABLE 7Temporal Matching Accuracy of Correctly Recognized Gestures (� ¼ 0:5)
TABLE 6Gesture Recognition Accuracy (� ¼ 0:5)
To demonstrate the advantage of using the reduced HMMin gesture spotting, we have obtained results using a left-to-right chain HMM without state skips. As shown in Table 9,the results obtained using the reduced model are signifi-cantly better than those of the chain model, especially interms of the false alarm rates. The recognition rates obtainedusing the reduced model are slightly better than those fromthe chain model, while the false alarm rates from the reducedmodel are only half of those from the chain model. Therefore,using the reduced HMM does improve gesture spottingcompared to the traditional left-to-right HMM.
A monocular, template-based gesture spotting system isintroduced in [13], where the percentage of correctly labeledframes has been used to measure the gesture spottingaccuracy. In [13], the global optimal path of the HMM statesobtained at the end of the data stream has been used toderive the frame labels. We have also obtained such a globaloptimal path from our HMM implementation and com-puted the frame-wise recognition rates as shown in Table 10.It can be seen that our overall result is better than that in [13].Table 10 also includes frame-wise online spotting resultsfrom our method. It is clear that using the global optimalpath increases frame-wise spotting accuracy. The gesture setin [13] includes all 14 gestures in the IXMAS data set, plus anew “stand still” gesture. To be consistent with [11] and [12],our experiments were done using 11 gestures. The compar-ison with [13] was also based on these gestures.
In addition to improved results, our proposed systemhas other advantages over that in [13]. A major weakness ofthe method in [13] is that the matching templates for gesturespotting depend on the specific camera tilt angle. In [13], aninput image is matched to synthetic image templatespregenerated for the given tilt angle. Such image templatesmust be generated again when a new tilt angle is adopted.This constraint makes the system setup more time andresource demanding. Although it is possible to store thematching templates for a set of tilt angles, it is unclear howwell the method in [13] performs when the actual cameratilt angle is different from its closest neighbor in theprestored angles. In contrast, the core tensor in our methoddoes not depend on the camera setup and it can be used toextract the pose feature from any voxel data. In addition,our approach requires less training data/equipment than
that in [13], which needs motion capture data and animatesoftware to synthesize the matching templates. Motioncapture will become necessary when the required motiondata are not available in a public database. In contrast, otherthan the visual hull data, our method does not require anymotion capture data/system or animation software.
6.5 Computational Complexity
The proposed gesture spotting approach is implementedusing Matlab. On average, it takes 2.4 seconds to processone frame of visual hull data. Pose feature extraction usingALS is the most time-consuming step. An optimized C++implementation can greatly reduce the running time. Infact, we have implemented a near-real-time, image-basedgesture spotting system [26], running up to 15 fps on astandard PC (3.6 GHZ dual-core Intel Xeron CPU, 3.25 GBRAM, Windows XP Professional).
7 CONCLUSIONS AND FUTURE WORK
In this paper, we present a gesture spotting frameworkfrom visual hull data. View-invariant features are extractedusing multilinear analysis and used as input to HMM-basedgesture spotting. As shown by the experimental results, theproposed pose features exhibit satisfying view-invarianceproperties, and using specific nongesture models improvesgesture spotting in terms of lower false alarm rates andhigher recognition reliability, without sacrificing much ofthe recognition rates.
In our future work, we will improve the proposed posefeature to achieve body-shape invariance. We will alsoperform research on obtaining more complete solutions topose feature extraction via multipoint initialized ALS.
ACKNOWLEDGMENTS
This work was supported in part by US National ScienceFoundation grants RI-04-03428 and DGE-05-04647. Theauthors are thankful to the referees for their insightfulcomments and to Stjepan Rajko for developing andreleasing the reduced HMM code to the public andproofreading an early version of the paper.
REFERENCES
[1] C. Cruz-Neira, D.J. Sandin, T.A. DeFanti, R.V. Kenyon, and J.C.Hart, “The Cave: Audio Visual Experience Automatic VirtualEnvironment,” Comm. ACM, vol. 35, no. 6, pp. 64-72, 1992.
[2] T. Starner, B. Leibe, D. Minnen, T. Westyn, A. Hurst, and J. Weeks,“The Perceptive Workbench: Computervision-Based GestureTracking, Object Tracking, and 3d Reconstruction of AugmentedDesks,” Machine Vision and Applications, vol. 14, pp. 59-71, 2003.
[3] C. Keskin, K. Balci, O. Aran, B. Sankur, and L. Akarun, “AMultimodal 3D Healthcare Communication System,” Proc. 3DTVConf., pp. 1-4, 2007.
[4] A. Camurri, B. Mazzarino, G. Volpe, P. Morasso, F. Priano, and C.Re, “Application of Multimedia Techniques in the PhysicalRehabilitation of Parkinsons Patients,” J. Visualization and Compu-ter Animation, vol. 14, pp. 269-278, 2003.
[5] H.S. Park, D.J. Jung, and H.J. Kim, “Vision-Based Game InterfaceUsing Human Gesture,” Advances in Image and Video Technology,pp. 662-671, Springer, 2006.
[6] S.-W. Lee, “Automatic Gesture Recognition for Intelligent Human-Robot Interaction,” Proc. IEEE Int’l Conf. Automatic Face and GestureRecognition, pp. 645-650, 2006.
1186 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 6, JUNE 2011
TABLE 10Comparison of Per-Frame Accuracy of Gesture Spotting
[7] A. Camurri, S. Hashimoto, M. Ricchetti, A. Ricci, K. Suzuki, R.Trocca, and G. Volpe, “Eyesweb: Toward Gesture and AffectRecognition in Interactive Dance and Music Systems,” ComputerMusic J., vol. 24, no. 1, pp. 57-69, 2000.
[8] G. Qian, F. Guo, T. Ingalls, L. Olson, J. James, and T. Rikakis, “AGesture-Driven Multimodal Interactive Dance System,” Proc. IEEEInt’l Conf. Multimedia and Expo, pp. 1579-1582, 2004.
[9] Y. Zhu and G. Xu, “A Real-Time Approach to the Spotting,Representation, and Recognition of Hand Gestures for Human-Computer Interaction,” Computer Vision and Image Understanding,vol. 85, pp. 189-208, 2002.
[10] H.-D. Yang, S. Sclaroff, and S.-W. Lee, “Sign Language Spottingwith a Threshold Model Based on Conditional Random Fields,”IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 7,pp. 1264-1277, July 2009.
[11] D. Weinland, R. Ronfard, and E. Boyer, “Free Viewpoint ActionRecognition Using Motion History Volumes,” Computer Vision andImage Understanding, vol. 104, nos. 2/3, pp. 249-257, 2006.
[12] D. Weinland, E. Boyer, and R. Ronfard, “Action Recognition fromArbitrary Views Using 3D Exemplars,” Proc. IEEE Int’l Conf.Computer Vision, pp. 1-7, 2007.
[13] F. Lv and R. Nevatia, “Single View Human Action RecognitionUsing Key Pose Matching and Viterbi Path Searching,” Proc. IEEEConf. Computer Vision and Pattern Recognition, 2007.
[14] H. Francke, J.R. del Solar, and R. Verschae, “Real-Time HandGesture Detection Recognition Using Boosted Classifiers andActive Learning,” Advances in Image and Video Technology, pp. 533-547, Springer, 2007.
[15] G. Ye, J.J. Corso, D. Burschka, and G.D. Hager, “VICS: A ModularHCI Framework Using Spatiotemporal Dynamics,” Machine Visionand Applications, vol. 16, no. 1, pp. 13-20, 2004.
[16] G. Ye, J.J. Corso, and G.D. Hager, “Gesture Recognition Using 3DAppearance and Motion Features,” Proc. IEEE Conf. ComputerVision and Pattern Recognition , pp. 160-166, 2004.
[17] M. Holte and T. Moeslund, “View Invariant Gesture RecognitionUsing 3D Motion Primitives,” Proc. IEEE Int’l Conf. Acoustics,Speech, and Signal Processing, pp. 797-800, 2008.
[18] T. Kirishima, K. Sato, and K. Chihara, “Real-Time GestureRecognition by Learning and Selective Control of Visual InterestPoints,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 27, no. 3, pp. 351-364, Mar. 2005.
[19] S. Mitra and T. Acharya, “Gesture Recognition: A Survey,” IEEETrans. Systems, Man, and Cybernetics, Part C: Applications and Rev.,vol. 37, no. 3, pp. 311-324, May 2007.
[20] A. Bobick and Y. Ivanov, “Action Recognition Using ProbabilisticParsing,” Proc. IEEE CS Conf. Computer Vision and PatternRecognition, pp. 196-202, 1998.
[21] A. Yilmaz, “Recognizing Human Actions in Videos Acquired byUncalibrated Moving Cameras,” Proc. IEEE Int’l Conf. ComputerVision, pp. 150-157, 2005.
[22] Y. Shen and H. Foroosh, “View-Invariant Action Recognition fromPoint Triplets,” IEEE Trans. Pattern Analysis and Machine Intelli-gence, vol. 31, no. 10, pp. 1898-1905, Oct. 2009.
[23] V. Parameswaran and R. Chellappa, “View Invariance for HumanAction Recognition,” Int’l J. Computer Vision, vol. 66, no. 1, pp. 83-101, 2006.
[24] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri,“Actions as Space-Time Shapes,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 29, no. 12, pp. 2247-2253, Dec. 2007.
[25] A.F. Bobick and J.W. Davis, “The Recognition of HumanMovement Using Temporal Templates,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 23, no. 3, pp. 257-267,Mar. 2001.
[26] B. Peng, G. Qian, and S. Rajko, “View-Invariant Full-Body GestureRecognition from Video,” Proc. Int’l Conf. Pattern Recognition,pp. 1-5, 2008.
[27] S. Eickeler, A. Kosmala, and G. Rigoll, “Hidden Markov ModelBased Continuous Online Gesture Recognition,” Proc. Int’l Conf.Pattern Recognition, pp. 1206-1208, 1998.
[28] J. Alon, V. Athitsos, Q. Yuan, and S. Sclaroff, “A UnifiedFramework for Gesture Recognition and Spatiotemporal GestureSegmentation,” IEEE Trans. Pattern Analysis and Machine Intelli-gence, vol. 31, no. 9, pp. 1685-1699, Sept. 2009.
[29] T. Starner, J. Weaver, and A. Pentland, “Real-Time American SignLanguage Recognition Using Desk and Wearable Computer BasedVideo,” IEEE Trans. Pattern Analysis and Machine Intelligence,vol. 20, no. 12, pp. 1371-1375, Dec. 1998.
[30] H.-K. Lee and J. Kim, “An HMM-Based Threshold ModelApproach for Gesture Recognition,” IEEE Trans. Pattern Analysisand Machine Intelligence, vol. 21, no. 10, pp. 961-973, Oct. 1999.
[31] H.-D. Yang, A.-Y. Park, and S.-W. Lee, “Gesture Spotting andRecognition for Humanrobot Interaction,” IEEE Trans. Robotics,vol. 23, no. 2, pp. 256-270, Apr. 2007.
[32] C. Chu and I. Cohen, “Pose and Gesture Recognition Using 3DBody Shapes Decomposition,” Proc. IEEE Conf. Computer Visionand Pattern Recognition, pp. 69-78, 2005.
[33] B. Peng and G. Qian, “Binocular Full-Body Pose Recognition andOrientation Inference Using Multilinear Analysis,” Tensors inImage Processing and Computer Vision, S. Aja-Fernandez, R. de LuisGarcıa, D. Tao, and X. Li, eds., Springer, 2009.
[34] H. Sakoe, “Dynamic Programming Algorithm Optimization forSpoken Word Recognition,” IEEE Trans. Acoustics, Speech, andSignal Processing, vol. ASSP-26, no. 1, pp. 43-49, Feb. 1978.
[35] L.R. Rabiner, “A Tutorial on Hidden Markov Models and SelectedApplications in Speech Recognition,” Proc. IEEE, vol. 77, no. 2,pp. 257-286, Feb. 1989.
[36] A. Mccallum, D. Freitag, and F. Pereira, “Maximum EntropyMarkov Models for Information Extraction and Segmentation,”Proc. Int’l Conf. Machine Learning, pp. 591-598, 2000.
[37] J.D. Lafferty, A. McCallum, and F.C.N. Pereira, “ConditionalRandom Fields: Probabilistic Models for Segmenting and LabelingSequence Data,” Proc. Int’l Conf. Machine Learning, pp. 282-289,2001.
[38] C. Myers, L. Rabiner, and A. Rosenberg, “Performance Tradeoffsin Dynamic Time Warping Algorithms for Isolated WordRecognition,” IEEE Trans. Acoustics, Speech, and Signal Processing,vol. 28, no. 6, pp. 623-635, Dec. 1980.
[39] A. Pikrakis, S. Theodoridis, and D. Kamarotos, “Recognition ofIsolated Musical Patterns Using Context Dependent DynamicTime Warping,” IEEE Trans. Speech and Audio Processing, vol. 11,no. 3, pp. 175-183, May 2003.
[40] J. Lichtenauer, E. Hendriks, and M. Reinders, “Sign LanguageRecognition by Combining Statistical Dtw and IndependentClassification,” IEEE Trans. Pattern Analysis and Machine Intelli-gence, vol. 30, no. 11, pp. 2040-2046, Nov. 2008.
[41] T.G. Dietterich, “Machine Learning for Sequential Data: AReview,” Proc. Joint IAPR Int’l Workshop Structural, Syntactic, andStatistical Pattern Recognition, pp. 15-30, 2002.
[42] H.-D. Yang, A.-Y. Park, and S.-W. Lee, “Robust Spotting of KeyGestures from Whole Body Motion Sequence,” Proc. IEEE Int’lConf. Automatic Face and Gesture Recognition, pp. 231-236, 2006.
[43] S. Rajko, G. Qian, T. Ingalls, and J. James, “Real-Time GestureRecognition with Minimal Training Requirements and OnlineLearning,” Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion, pp. 1-8, 2007.
[44] K. Nickel and R. Stiefelhagen, “Visual Recognition of PointingGestures for Human-Robot Interaction,” Image and Vision Comput-ing, vol. 25, no. 12, pp. 1875-1884, 2007.
[45] B. Peng, G. Qian, and S. Rajko, “View-Invariant Full-Body GestureRecognition via Multilinear Analysis of Voxel Data,” Proc. Int’lConf. Distributed Smart Cameras, 2009.
[46] L.D. Lathauwer, B.D. Moor, and J. Vandewalle, “A MultilinearSingular Value Decomposition,” SIAM J. Matrix Analysis andApplications, vol. 21, no. 4, pp. 1253-1278, 2000.
[47] L. Elden, Matrix Methods in Data Mining and Pattern Recognition.SIAM, 2007.
[48] M.A.O. Vasilescu and D. Terzopoulos, “Multilinear Analysis ofImage Ensembles: Tensorfaces,” Proc. European Conf. ComputerVision, pp. 447-460, 2002.
[49] D. Vlasic, M. Brand, H. Pfister, and J. Popovi, “Face Transfer withMultilinear Models,” Proc. ACM SIGGRAPH, pp. 426-433, 2005.
[50] M.A.O. Vasilescu and D. Terzopoulos, “Tensortextures: Multi-linear Image-Based Rendering,” ACM Trans. Graphics, vol. 23,no. 3, pp. 334-340, 2004.
[51] M.A.O. Vasilescu, “Human Motion Signatures: Analysis, Synth-esis, Recognition,” Proc. Int’l Conf. Pattern Recognition, pp. 456-460,2002.
[52] J. Davis and H. Gao, “An Expressive Three-Mode PrincipalComponents Model of Human Action Style,” Image and VisionComputing, vol. 21, no. 11, pp. 1001-1016, 2003.
[53] C.-S. Lee and A. Elgammal, “Modeling View and PostureManifolds for Tracking,” Proc. IEEE Int’l Conf. Computer Vision,pp. 1-8, 2007.
PENG AND QIAN: ONLINE GESTURE SPOTTING FROM VISUAL HULL DATA 1187
[54] S. Rajko and G. Qian, “Hmm Parameter Reduction for PracticalGesture Recognition,” Proc. IEEE Int’l Conf. Face and GestureRecognition, pp. 1-6, 2008.
[55] J. Shi, S. Belongie, T. Leung, and J. Malik, “Image and VideoSegmentation: The Normalized Cut Framework,” Proc. IEEE Int’lConf. Image Processing, pp. 943-947, 1998.
[56] H.A.L. Kiers, “An Alternating Least Squares Algorithms forParafac2 and Three-Way Dedicom,” Computational Statistics andData Analysis, vol. 16, no. 1, pp. 103-118, 1993.
Bo Peng received the BS degree in electricalengineering from Zhejiang University, Hang-zhou, China, in 2006. He is currently workingtoward the PhD degree in electrical engineeringat Arizona State University, Tempe. He was amember of Chu Kochen Honors College, Zhe-jiang University from 2002 to 2006. His researchinterests include human motion analysis, com-puter vision, and machine learning.
Gang Qian received the BE (Distinction) degreefrom the University of Science and Technologyof China (USTC), Hefei, China, in 1995. Hereceived the MS and PhD degrees in electricalengineering from the University of Maryland,College Park, in 1999 and 2002, respectively.He is an assistant professor in the School ofArts, Media and Engineering, and the School ofElectrical, Computer and Energy Engineering atArizona State University, Tempe. He was a
faculty research assistant (2001-2002) and a research associate (2002-2003) at the Center for Automation Research (CfAR) at the University ofMaryland Institute for Advanced Computer Studies. He has served onthe organizing/technical committees of a number of internationalconferences, including the 2008 and 2009 International Conference onImage Processing, and the 2006 International Conference on Image andVideo Retrieval. His current research includes computer vision andpattern analysis, sensor fusion and information integration, multimodalsensing and analysis of human movement and activities, human-computer interaction and human-centered interactive systems, andmachine learning for computer vision. He is a member of the IEEE andthe IEEE Computer Society.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
1188 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 33, NO. 6, JUNE 2011