136 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 46, NO. 1 ...€¦ · 136 IEEE TRANSACTIONS ON...

136 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 46, NO. 1, JANUARY 2016

A Gesture Recognition System for DetectingBehavioral Patterns of ADHD

Miguel Ángel Bautista, Antonio Hernández-Vela, Sergio Escalera, Laura Igual, Oriol Pujol,Josep Moya, Verónica Violant, and María T. Anguera

Abstract—We present an application of gesture recognitionusing an extension of dynamic time warping (DTW) to recog-nize behavioral patterns of attention deficit hyperactivity disor-der (ADHD). We propose an extension of DTW using one-classclassifiers in order to be able to encode the variability of a gesturecategory, and thus, perform an alignment between a gesture sam-ple and a gesture class. We model the set of gesture samples ofa certain gesture category using either Gaussian mixture modelsor an approximation of convex hulls. Thus, we add a theoreti-cal contribution to classical warping path in DTW by includinglocal modeling of intraclass gesture variability. This methodol-ogy is applied in a clinical context, detecting a group of ADHDbehavioral patterns defined by experts in psychology/psychiatry,to provide support to clinicians in the diagnose procedure. Theproposed methodology is tested on a novel multimodal dataset(RGB plus depth) of ADHD children recordings with behavioralpatterns. We obtain satisfying results when compared to standardstate-of-the-art approaches in the DTW context.

Index Terms—Attention deficit hyperactivity disorder (ADHD),convex hulls, dynamic time warping (DTW), Gaussian mixturemodels (GMMs), gesture recognition, multimodal RGB-depthdata.

I. INTRODUCTION

NOWADAYS, human gesture recognition is one of themost challenging tasks in computer vision. Due to the

large number of potential applications involving human ges-ture recognition in fields like surveillance [10], sign languagerecognition [28], or clinical assistance [21] among others, thereis a large and active research community devoted to deal with

Manuscript received March 22, 2013; revised December 23, 2014 andJanuary 20, 2015; accepted January 22, 2015. Date of publication February 24,2015; date of current version December 14, 2015. This work was supportedin part by the Ministerio de Sanidad 2011 Instituto de Mayores y ServiciosSociales Ref. MEDIMINDER and RERECAIXA 2011 Ref. REMEDI, andin part by TIN2013-43478-P. The work of M. Á. Bautista was supportedin part by the Secretaria Universitats i Recerca - Departament d’Economiai Coneixement of the Generalitat de Catalunya, and in part by Fons SocialEuropeu. The work of A. Hernández-Vela was supported by the FormaciónProfesor Universitario Fellowship from the Ministerio de Educación of Spain.This paper was recommended by Associate Editor S. Zafeiriou.

M. Á. Bautista, A. Hernández-Vela, S. Escalera, L. Igual, andO. Pujol are with the Department of Applied Mathematics and Analysis,Universitat de Barcelona, Barcelona 08007, Spain, and also with theComputer Vision Center, Campus UAB, Barcelona 08193, Spain (email:[email protected]).

J. Moya is with the Parc Taulí Foundation, Sabadell 08208, Spain.V. Violant is with the Department of Didactics and Educational

Organization, University of Barcelona, Barcelona 08035, Spain.M. T. Anguera is with the Department of Behavioral Sciences

Methodologies, University of Barcelona, Barcelona 08035, Spain.Color versions of one or more of the figures in this paper are available

online at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TCYB.2015.2396635

this problem. Current methodologies have shown preliminaryresults on very simple scenarios, but they are still far fromhuman performance.

In the gesture recognition field there exists a wide number ofmethods based on dynamic programming algorithms for bothalignment and clustering of temporal series [30]. Probabilisticmethods such as hidden Markov models or conditional ran-dom fields are also very usual in [28]. Nevertheless, one ofthe most common methods for human gesture recognition isdynamic time warping (DTW) [3], [23]. It offers a simple yeteffective temporal alignment between sequences of differentlengths. However, the application of such methods to gesturedetection in complex scenarios becomes a hard task due to thehigh variability of the environmental conditions among dif-ferent domains. Some common problems are: wide range ofhuman pose configurations, influence of background, continu-ity of human movements, spontaneity of human actions, speed,appearance of unexpected objects, illumination changes, par-tial occlusions, or different points of view, just to mentiona few. These effects can cause dramatic changes in thedescription of a certain gesture, generating a great intraclassvariability. In this sense, since usual DTW is applied betweena sequence and a single pattern, it fails to take into accountsuch variability. In this sense, some methods to tackle thisproblem have recently appeared in [9], [17], and [29].

In addition, the release of the Microsoft Kinect sensor inlate 2010 has allowed an easy and inexpensive access to syn-chronized depth imaging with standard video data. This datacombines both sources into what is commonly named RGB-Dimages (RGB plus depth). This data fusion, very welcomedby the computer vision community, has reduced the burden ofthe first steps in many pipelines devoted to image or objectsegmentation and opened new questions such as how this datacan be effectively described and fused. This depth informationhas been particularly exploited for human body segmentationand tracking. Shotton et al. [25] introduced one of the great-est advances in the extraction of the human body pose usingRGB-D, which is provided as part of the Kinect human recog-nition framework. The method is based on inferring pixel labelprobabilities through Random Forest from learned offsets ofdepth features. Girshick et al. [8] proposed later a differentapproach in which they directly regress the positions of thebody joints, without the need of an intermediate pixel-wisebody limb classification as in [25]. The extraction of bodypose information opens the door to develop more accurategesture recognition methodologies [7], [19].

2168-2267 c© 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

BAUTISTA et al.: GESTURE RECOGNITION SYSTEM FOR DETECTING BEHAVIORAL PATTERNS OF ADHD 137

In particular, there is a growing interest in the application ofgesture recognition methods in the clinical context. Concretely,gesture recognition methods can be even more valuable onpsychological or psychiatric scenarios where the diagnosticof a certain disease is based on the interpretation of certainbehavioral patterns of the subject. Up-to-date, video sequenceswere analyzed on a frame-by-frame fashion by experts whichwere typically trained for several months to achieve a goodperformance on the analysis. Of course, this situation is notapplicable to large amounts of data since it is a very timeconsuming procedure and its automatization is highly desir-able. Specifically, the case of attention deficit hyperactivitydisorder (ADHD) is one of the most notable scenarios, sinceit is the most commonly studied and diagnosed psychiatricdisorder in childhood, globally affecting about five percentof children [16]. In this line of research some works can befound in [11] and [27], which develop tools to assist chil-dren with autism-related disorders. Nevertheless, one of themain problems that clinicians experiment when diagnosingADHD is the huge subjective component of the interpretationof symptoms, because their definition is either ambiguous orinaccurate. In this sense, an objective gesture recognition toolwhich is able to detect behavioral patterns defined by a setof psychiatric/psychological experts will be of great value inorder to help the clinicians with ADHD diagnose. This paperpretends to be a study on a concrete set of ADHD patterns,which aims to be extended in future works.

We propose to use an extension of the DTW method, that isable to perform an alignment between a sequence and a set ofN pattern samples from the same gesture category. The vari-ance caused by environmental factors is modeled using either aGaussian mixture model (GMM) [26] or an approximation of aconvex hull [5]. Consequently, the distance metric used in theDTW framework is redefined in order to provide a probability-based measure. The proposed method is evaluated in a novelADHD behavioral pattern dataset, in which both subject diag-nosed with ADHD and a control group where recorded in aclass-room environment, obtaining satisfying results. Our listof contributions is as follows.

1) An extension of classical DTW by modeling the intra-class variability of gestures is proposed.

2) GMMs and approximated convex hulls are embedded inthe DTW by defining novel distances.

3) A novel multimodal ADHD behavioral pattern datasetis presented.

4) We test our proposal in the novel ADHD behavioralpatterns dataset obtaining very satisfying results.

The rest of this paper is structured as follows. Section IIpresents the gesture recognition proposal. Section IV presentsa novel ADHD dataset and shows the experimental results ona novel ADHD behavioral pattern dataset. Finally, Section Vsummarizes the conclusion.

II. DEFINITION OF ADHD BEHAVIORAL PATTERNS

AND FEATURE EXTRACTION

We split the methodology of the proposal in different stages.First, we define the ADHD behavioral patterns to be learnt.

Second, the considered set of multimodal features for eachframe is described, and finally, the novel DTW extension basedon GMM and convex hull modeling is presented.

A. Definition of ADHD Behavioral Patterns

ADHD is one of the most common childhood disorders andcan continue through adolescence and adulthood. Symptomsinclude difficulty staying focused and paying attention, diffi-culty controlling behavior, and hyperactivity. ADHD has threesubtypes, defined by the diagnostic and statistical manual ofmental disorders (DSM IV) and CIE X [14], [18].

1) Predominantly hyperactive-impulsive.2) Predominantly inattentive.3) Combined hyperactive-impulsive and inattentive.In addition children who have symptoms of inattention may:1) be easily distracted, miss details, forget things, and

frequently switch from one activity to another;2) have difficulty focusing on one task;3) become bored with a task after only a few minutes,

unless they are doing something enjoyable;4) have difficulty focusing attention on organizing and

completing a task or learning something new.Children who have symptoms of hyperactivity may:1) fidget and squirm in their seats;2) dash around, touching, or playing with anything and

everything in sight;3) have trouble sitting still during dinner, school, and story

time;4) be constantly in motion.Children who have symptoms of impulsiveness may:1) have difficulty waiting for things they want or waiting

their turns in games;2) often interrupt conversations or other activities.In order to develop a system that automatically detects

ADHD behavioral patterns, first we have to define a set ofADHD behavioral patterns (gestures to detect) that are bothobjective and descriptive yet discriminable. In other words,the set of patterns has to be descriptive enough to provide anADHD profile of the subject, and simple enough in order tobe able to automatize the detection. Furthermore, the symp-toms depicted in the DSM [18], were strictly followed, and theset of behavioral patterns was designed in conjunction with ateam of experts in ADHD.

In order to define the behavioral patterns to be automati-cally detected, an analysis of the context in which the videosequences take place has to be performed. Taking into accountthat video sequences were recorded in a school class context,including mathematical exercises and computer gaming, withno disturbing events taking place, the set of defined ADHDbehavioral patterns is the following (an example is shownin Fig. 1).

1) Head Turning Behavioral Pattern: The definition of thisbehavioral pattern takes its reason from the differentsymptoms in the inattention branch. Behaviors like beeasily distracted, miss details, forget things, and fre-quently switch from one activity to another, or havedifficulty focusing on one thing have a close relation-ship with turning the head from the goal task to other


Fig. 1. (a) Example of the head turning behavioral pattern. (b) Torso in tablepattern example, notice how the torso of the subject is completely laid on thetable. (c) Sample of a class mate invasion in which the left subject invadesthe right subject space. (d) Movement behavioral pattern sample.

unrelated task. Therefore, this indicator is defined as ahead turn to either right or left sides.

2) Torso in Table Behavioral Pattern: The Torso in tablebehavioral pattern is related to hyperactive symptomssuch fidget and squirm in their seats and have troublesitting still during dinner, school, and story time.

3) Classmate’s Desk Invasion Behavioral Pattern: Thisbehavioral pattern takes is root from the impulsivesymptoms like often interrupt conversations or others’activities or have difficulty waiting for things they wantor waiting their turns in games.

4) Movement With/Without a Pattern Behavioral Pattern:The last pattern aims to provide a detection for thosesymptoms across all ADHD branches (inattentiveness,hyperactivity, and impulsiveness) that involve a highquantity of motion.

This set of behavioral patterns is representative enough ofthe different symptoms of ADHD and provides a generaliza-tion analysis of the feasibility of our approach for supportingdiagnosis.

B. Image Acquisition, Preprocessing, and Feature Extraction

We use the Kinect sensor in order to capture videosequences in which subjects diagnosed with ADHD and sub-jects not diagnosed with ADHD (control group) were recorded.In this sense, we use the depth information provided by theKinect sensor to obtain a segmentation of the subjects in thescene obtaining a complete segmentation of their upper-bodylimbs. Given a frame It, t = 1, . . . ,T , the corresponding seg-mentation St on the depth map is computed by Otsu [20]method, keeping the biggest convex unconnected components,that is, the components with a larger number of pixels. Inthis sense, if three subjects appear on the scene, the three

Fig. 2. Color descriptor for the “head turning” behavioral pattern. Imagein first column shows a subject turning the head, while in the image at thelast column shows a frontal face. Bounding boxes are overlaid in green color.Images in the central column show the respective color naming descriptors.They are composed by 4 × 4 cells, each one of them containing a color namelabel.

biggest components were kept. Otherwise, if two subjectsappear on the scene, the two biggest convex unconnectedcomponents are kept as the segmentation. Moreover, RandomForest segmentation is applied over the foreground objects [25]in order to segment the regions corresponding to differentsubjects.

1) Head Turning Behavioral Pattern Feature: The featuresfor the head rotation detection are computed for each frame tas follows. First of all, we obtain the bounding box Bt con-taining the head, by means of GrabCut segmentation [24].As GrabCut is a semi-automatic method, a manual boundingbox has to be provided by the user at the first frame. Withthe resulting segmentation mask, the bounding box for thatframe can be easily computed. Additionally, some morpho-logical operations are applied on the segmentation mask inorder to initialize the segmentation of the following frame, asin [12]. Once a bounding box Bt is detected for one frame,a color-based descriptor FHeadt is extracted from the pixelsinside it. The bounding box is firstly divided in O × O cells,and each one of them is described with a label γ ∈ {1, . . . ,G}corresponding to the most frequent color as follows:

FHeadti,j = arg max

l∈γ

⎛⎜⎝

∑

x∈Bti,j

δ(ColorName(x)− l)

⎞⎟⎠

∀i ∈ 1, . . . , O

∀j ∈ 1, . . . ,O (1)

where Bti,j is the (i, j)th cell of the head bounding box at

time t. In addition, ColorName(x) is a function which returnsthe color name of an RGB pixel x, and δ(·) is a Dirac deltafunction. The color-naming data with G = 11 basic colors(red, orange, brown, yellow, green, blue, purple, pink, white,gray, black) presented in [22] has been used. An example ofthe feature computation procedure is shown in Fig. 2.

2) Torso on Desk Behavioral Pattern Feature: The torsoon desk feature computes the relative distance of the subject’storso to the desk, in order to provide a measure of how closethe torso is in relation to the desk. In this sense, this distanceis computed as the Euclidean distance of the top pixel of the


Fig. 3. Distances computed using the segmentation of the depth image.Green contour indicates the boundary of the segmentation mask. Blue dashedline shows the table limit. Vertical and horizontal orange arrows show thedistances computed for the “Torso in table” and “classmatet’s desk invasion”behavioral patterns, respectively.

segmentation mask of the subject to the closest desk pixel.This distance can be easily computed by finding the uppermostpixel xtop = {xi|(xi, yi) ∈ St, (xj, yj) ∈ St, yi ≤ yj,∀i �= j} inthe segmentation mask St of the subject, and its correspondinglowermost pixel in vertical direction xbot = {xi|(xi, yi) ∈ St,

(xj, yj) ∈ St, yi ≥ yj,∀i �= j}FTorsot =

∥∥∥xtop − xbot∥∥∥

2. (2)

An example of the feature calculation is shown in Fig. 3.3) Classmate’s Desk Invasion Feature: In order to com-

pute the classmate’s desk invasion feature, we also use thesegmentation mask St. For a given subject, the feature is basi-cally defined as the minimum distance between the pixels inthe subject’s unconnected components of the mask St, and thepixels in the neighbor classmate’s components St

ne (ne = 1, 2in our case)

FInvt = minne∈N,xn∈St

n,xs∈St‖xs − xn‖2. (3)

An example of this computation is shown in Fig. 3.4) Movement With/Without Pattern Feature: In the move-

ment with/without a pattern feature we want to describegeneral movements of the subject. Subjects that are diagnosedwith Hyperactivity often perform two different types of motionpatterns: random movements, which is often denoted as agi-tation, and movements that follow a pattern, as in the case ofnervous tics, in which a group of muscles moves in a repet-itive fashion. This feature is designed in order to cope withboth cases. In this sense, we first compute the optical flow [15]between current and next frames. Then, we compute the aver-age optical flow magnitude over the pixels belonging to thesegmentation mask of the subject

FMovt = 1

|St|∑x∈St

√u2

x + v2x (4)

where ux and vx are the components of the flow vector betweencurrent frame It and It+1, and | · | computes the number ofelements of the set.

III. DTW BASED ON ONE-CLASS CLASSIFIERS

The original DTW algorithm was defined to match temporaldistortions between two models, finding an alignment/warpingpath between the two time series Q = {q1, . . . , qn} andC = {c1, . . . , cm}. In order to align these two sequences,an Mm×n matrix is designed, where the position (i, j) of thematrix contains the alignment cost between ci and qj. Then,a warping path of length τ is defined as a set of contiguousmatrix elements, defining a mapping between C and Q: W ={w1, . . . ,wτ }, where wi indexes a position in the cost matrix.This warping path is typically subjected to several constraints.

Boundary conditions: w1 = (1, 1) and wτ = (m, n).Continuity and monotonicity: Given wτ ′−1 = (a′, b′), then

wτ ′ = (a, b), a − a′ ≤ 1 and b − b′ ≤ 1. This condition forcesthe points in W to be monotonically spaced in time.

We are generally interested in the final warping path that,satisfying these conditions, minimizes the warping cost

DTW(M) = minW

{M(wτ )

τ

}(5)

where τ compensates the different lengths of the warpingpaths. This path can be found very efficiently using dynamicprogramming. The cost at a certain position M(i, j) can befound as the composition of the Euclidean distance d(i, j)between the feature vectors of the sequences ci and qj andthe minimum cost of the adjacent elements of the cost matrixup to that point, i.e., M(i, j) = d(i, j) + min{M(i − 1, j − 1),M(i − 1, j),M(i, j − 1)}.

Given the streaming nature of our problem, the input vectorQ has no definite length and may contain several occurrences agesture class, namely C. At that point the system considers thatthere is correspondence between the current block k in Q anda gesture if satisfying the following condition, M(m, k) < β,

k ∈ [1, . . . ,∞] for a given cost threshold β.This threshold is estimated in advance using leave-one-out

cross-validation strategy on the training set. This involvesusing a single observation from the original sample as thevalidation data, and the remaining observations as the trainingdata. This is repeated such that each observation in the sampleis used once as the validation data. At each iteration, we eval-uate the similarity value between the candidate and the rest ofthe training set. Finally, we choose the threshold value whichis associated with the largest number of hits.

Once the threshold is defined and a possible end of pat-tern of gesture is detected, the working path W can be foundthrough backtracking of the minimum path from M(m, k) toM(0, z), being z the instant of time in Q where the gesturebegins. Note that d(i, j) is the cost function which measuresthe difference among our descriptors ci and qj.

An example of a begin-end gesture recognition together withthe warping path estimation is shown in Fig. 5.

A. Handling Temporal Deformation in Sequences

Consider a training set of N sequences {S1, S2, . . . , SN},where all sequences belong to a certain gesture class. Then,each sequence Sg is composed by a set of feature vectorsat each time t, Sg = {sg

1, . . . , sgLg}, where Lg is the length


(a)

(c)

(b)

(d)

Fig. 4. (a) Different sample sequences of a certain gesture category andthe mean length sample. (b) Alignment of all samples with the mean lengthsample by means of Euclidean DTW. (c) Warped sequences set S from whicheach set of tth elements among all sequences are modeled. (d) GMM learningwith three components.

in frames of sequence Sg. Let us assume that sequences areordered according to their length, so that Lg−1 ≤ Lg ≤Lg+1,∀g ∈ [2, . . . ,N − 1], and the median length sequence isS = S� N

2 �. The sequence with median length S is obtained fromthe training set of a certain behavioral pattern (e.g., head turn),and the rest of sequences in the training set of such behavioralpattern are aligned with respect to the median length sequence(using standard DTW). After this process, all sequences inthe training set of the same behavioral pattern have the samelength. Thus, avoiding the temporal deformations of differentsamples from the same behavioral. Therefore, after the align-ment process, all sequences have length L�N/2�. We define theset of warped sequences as {S1, S2, . . . , SN}.

Once all samples are aligned, the feature vectors corre-sponding to a certain time t among all sequences sg

t ∀g ∈[1, . . . ,N] are modeled by means of one-class classifiers(i.e., GMMs) in order to encode intraclass variability. Anexample of the process using GMMs is shown in Fig. 4.

B. Embedding One-Class Classifiers in DTW

In the classical DTW, a pattern and a sequence are alignedusing a distance metric, such as the Euclidean distance. Sinceour pattern is modeled by means of one-class models, if wewant to use the principles of DTW, the distance needs to beredefined. Next, we propose two cost distances, one based onGMM and the other on approximated convex hull.

1) GMMs: Following [3], we use GMMs to learn the fea-tures among all sequence samples (of a gesture category) at acertain time t, sg

t ∀g ∈ [1, . . . ,N]. Since after the alignmentstep all the sequences have the same length, L�N/2�, we learnL�N/2� GMMs, one per each component.

In this sense, a G−component Gaussian mixture model, isdefined as, λt = {αt

k, μtk, �

tk}, k = 1, . . . ,G, where α is the

mixing value and μ and � are the parameters of each of theG Gaussian models in the mixture. As a result, each one ofthe GMMs that model each set of tth components st, amongall warped sequence samples, is defined as follows:

p(st) =G∑

k=1

αk · e− 12 (q−μk)

T ·�−1k ·(q−μk). (6)

The resulting model is composed by a set of L�N/2� GMMscorresponding to the modeling of each one of the componentelements of the warped sequence st for each gesture pattern.

In this paper, we consider a soft-distance based on the prob-ability of a point belonging to each one of the G componentsin the GMM, i.e., the posterior probability of q ∈ Q is obtainedaccording to (6). In addition, since

∑k1 αk = 1, we can com-

pute the probability of x belonging to the whole GMM λ asfollows:

PGMM(q, λ) =M∑

k=1

αk · P(q)k (7)

P(x)k = e− 12 (x−μk)

T ·�−1k ·(x−μk) (8)

which is the sum of the weighted posterior probability of eachcomponent. However, an additional step is required since thestandard DTW algorithm is conceived for distances instead ofsimilarity measures. In this sense, we use a soft-distance-basedmeasure of the probability, which is defined as

D(x, λ) = e−PGMM(x,λ). (9)

An example of the use of GMMs framework to detect agiven gesture is shown in Fig. 5.

2) Convex Hulls and Approximate Convex PolytopeDecision Ensemble (APE): In addition to the use of GMMsas one-class classifiers, we also propose to use convex hullsto model the set of features, sg

t ∀g ∈ [1, . . . ,N]. The under-lying idea of convex hulls is to model the boundary of theset of points defining the problem. If the boundary enclosesa convex area, then the convex hull, defined as the minimalconvex set containing all the training points, provides a goodgeneral tool for modeling the target class, which in our casewill be the set of features of all sequence samples at a certaintime.

The convex hull of a set C ⊆ R, denoted conv C, is defined

as the smallest convex set that contains C and is defined asthe set of all convex combinations of points in C

conv C ={θ1x1 + · · · + θmxm | xi ∈ C, θi ≥ 0,∀i;

∑i

θi = 1

}.

(10)

In this scenario, the one-class classification task is reducedto the problem of knowing if test data lie inside or outsidethe hull. Although the convex hull provides a compact repre-sentation of the data, a small amount of outliers may lead tovery different shapes of the convex polytope. Thus, a decisionusing these structures is prone to over-fitting. Casale et al. [5]


Fig. 5. Begin-end of gesture recognition of a gesture pattern in an infinitesequence Q using the probability-based DTW. Note that different samples ofthe same gesture category are modeled with a GMM and this model is usedto provide a probability-based distance. In this sense, each cell of M willcontain the accumulative D distance.

showed that it is useful to define a parameterized set of con-vex polytopes associated with the original convex hull of thetraining data. This set of polytopes are shrunk/enlarged ver-sions of the original convex hull governed by a parameter ϕ.The goal of this family of polytopes is to define the degreeof robustness to outliers. The parameter ϕ defines a constantshrinking (−‖℘−ς‖ ≤ ϕ ≤ 0) or enlargement (α ≥ 0) of theconvex structure with respect to the center c. If ϕ = 0 then℘0 = conv C.

However, the creation of high-dimensional convex hullsis computationally intensive. In general, the cost for com-puting a -dimensional convex hull on N data examples isO(N�/2�+1). This cost is prohibitive in time and memory and,for the classification task, only checking if a point lies insidethe multidimensional structure is needed. Instead, we proposeto use the APE of [5]. This method consists in approximat-ing the decision made using the extended convex polytope inthe original -dimensional space by aggregating a set of Fdecisions made on low-dimensional random projections of thedata. As shown in Fig. 6.

Since the projection matrix is created at random, the result-ing space does not preserve the norm of the original space.Hence, a constant value of the parameter ϕ in the originalspace corresponds to a set of values γi in the projected one. Asa result, the low-dimensional approximation of the expandedpolytope is defined by the set of vertices as follows:

℘ϕ :

{℘i + ωi

(℘i − ς )

‖℘i − ς‖}, i = 1, . . . ,N (11)

Fig. 6. Begin-end of gesture recognition of a gesture pattern in an infinitesequence Q using the probability-based DTW. In this example, APEs are usedto model each set of ith frames.

where ς = ρς represents the projected center, ℘i is the setof vertices belonging to the convex hull of the projected dataand γi is defined as follows:

ωi = (℘i − ς)TρTρ(℘i − ς)

‖℘i − ς‖ α (12)

where ρ is the random projection matrix, ς is the center and ℘i

is the ith vertex of the convex hull in the original space. Notethat there exist a different expansion factor for each vertex ℘i

belonging to the projected convex hull. Thus, we defined anAPE model as

ψ ={℘ϕf

}(13)

where f ∈ [1, . . . ,F], and F is the number of total randomprojections used to approximate the original convex hull. Inthis sense, to obtain the probability of a point belonging to theextended/shrunken convex polytope ensemble we compute theproportion of low-dimensional random projections in whichthe testing point q lies inside the extended convex polytope. Inthis sense, we get an approximate measure of how probable isthe point to be inside the original convex hull. The calculationof the proportion is as follows:

PAPE(q, ψ) =

F∑f =1

q ∈ conv ℘ϕf

F. (14)

Following the same scheme used with GMM, we computea soft distance based on the proportion of random projectionsin which the testing point q lies inside the extended convexpolytope. This soft-distance is defined as follows:

D(q, ψ) = e−PAPE(q,ψ). (15)

Finally, Table I shows the proposed DTW algorithm forbegin-end gesture detection, where the compute distance D iscomputed from APE models.


TABLE IPROBABILITY-BASED DTW APPLIED TO BEGIN-END OF GESTURE

RECOGNITION, USING APES AS BASE CLASSIFIERS

IV. EXPERIMENTAL RESULTS

In order to present the experimental results, first, we intro-duce the data, methods, and evaluation measurements of theexperiments.

A. ADHD Behavioral Patterns Dataset

In this section, we introduce the novel dataset in which theexperiments are performed. The ADHD behavioral patternsdataset is composed of 18 video sequences in which both,a group of three subjects diagnosed with ADHD and threecontrol subjects are recorded in a scholar context, performingrecreational and mathematical tasks. These video sequenceswere recorded using the Kinect sensor, which is able to obtainRGB and depth information at 24 frames/s. The features of thedataset are the following.

1) There is an equal proportion of video sequences ofADHD subjects and the control group.

2) There is an equal proportion of video sequences inwhich the subjects were performing recreational tasksand mathematical tasks.

3) The mean length of the video sequences was approxi-mately 5 min each.

4) Outlier events taking place during the recording sessionswere manually filtered from the sequences.

For each one of the video sequences a manual labelingprocess was performed, in which two independent observerslabeled the start and ending points of each one of the fourbehavioral patterns defined in Section II-A (head turn, torsoin table, classmate desk invasion, and movement with/withoutpattern). The agreement of the labeling of the independentobservers was measured with the well-known Cohen’s Kappacoefficient for interannotator agreement [4]. In order to obtainthis measure, we used the GSEQ software presented in [2].Finally, the mean Cohen’s Kappa statistic of the labeling

TABLE IINUMBER OF SAMPLES PER SUBJECT AND BEHAVIORAL PATTERNS

procedure was 0.93, which follows in the interval defined asalmost perfect agreement in [13], and thus, this labeling isused as the ground truth for evaluating the performance ofthe proposed methodologies. Table II shows a summary of thenumber of samples per subject and behavioral pattern. In addi-tion, in Fig. 7 some samples of the ADHD behavioral patterndataset are shown. The overall number of frames in the datasetis approximately 50 000.

B. Methods

We compare the following methods, which have beenproposed in this paper.

1) DTW Random: Aligning the streaming sequence Q witha sample selected randomly from the training set of ges-ture samples for a certain behavioral pattern, using thestandard Euclidean distance.

2) DTW Mean: Aligning the streaming sequence Q with themean of the samples S for a certain behavioral pattern,using also the Euclidean distance. Note that, we cancompute the mean of a certain behavioral pattern sinceall samples have the same length after being aligned.

3) DTW GMM: The streaming sequence Q is aligned toa certain behavioral pattern by taking into account theprobability of a element in Q to belong to a GMMwhich is learned using all the samples of such behavioralpattern, as proposed in Section III-B1.

4) DTW APE: The sequence Q is aligned to a certain gesturecategory by modeling the probability of an element in Qof belonging to such behavioral pattern as the proportionof random projections in which the point lies inside aprojected convex hull, as proposed in Section III-B2.

C. Evaluation Measurements

The evaluation measurements are overlapping and accuracyrecognition (in percentage). For the accuracy analysis, we con-sider that a gesture is correctly detected if overlapping in thegesture sub-sequence is greater than 60% (the standard over-lapping value [1]). The overlapping measure is defined by(g

⋂p)/(g

⋃p), where g is the ground truth and p the pre-

diction. The cost threshold β for all methods was obtained bymeans of a stratified five-fold cross-validation, as well as, thenumber of components in the GMM and the number of pro-jections in APE. In addition, we apply the Nemenyi apost-hoctest [6] in order to look for statistical significance among theobtained performances.

Furthermore, to allow a deeper analysis of the proposedmethodologies and their clinical impact, in our evaluationswe use a “don’t care” value which provides a more flexibleinterpretation of the results. Consider the ground truth of a


Fig. 7. (a) RGB image of the subjects diagnoses with ADHD performing mathematical tasks. (b) Depth information of ADHD subjects performingmathematical exercises. (c) RGB frame of ADHD subject in the recreational task context. (d) Depth information of ADHD subjects. (e) RGB image of thecontrol group. (f) Depth image of the control group.

Fig. 8. (a) Example of overlapping between a prediction and the groundtruth. (b) Example where the don’t care value is used to soften the overlapmetric.

certain gesture category in a video sequence as a binary vector,which activates when a sample of such category is observedin the sequence. Then, the don’t care value is defined as thenumber of bits (frames) which are ignored at the limits of eachone of the ground truth instances. Thus, by using this approachwe can compensate the pessimistic overlap metric in situationswhen the detection is shifted some frames. An example of thissituation is shown in Fig. 8.

D. Experimental Results

Fig. 9 shows the overlapping and accuracy percentages ofeach one of the compared methods and for each one of thedefined behavioral patterns.

In order to present a more reduced and understandable ver-sion of the results, we selected specific don’t care values andperformed an analysis on those cases. In Tables III and IV,we show the overlapping and accuracy values per behavioralpattern and method for certain don’t care values.

Finally, Table V shows the mean rank per each methodologyand the final mean rank.

Once all the rankings are computed, in order to reject thenull hypothesis that the measured performance ranks differfrom the mean performance rank, and that the performanceranks are affected by randomness in the results, we use theFriedman test. Thus, with h = 4 methods to compare and U =4 behavioral patterns × 4 don’t care values (1,50,100,150) × 2metrics (overlapping and accuracy) = 32, the Friedman statis-tic value is computed as follows, where V is the meanrank:

X2F = 12U

h(h + 1)

⎡⎣∑

j

V2j − h(h + 1)2

4

⎤⎦. (16)

In our case, with h = 4 DTW methods to compare,X2

F = 14.8875. Since this value is undesirable conservative,Iman and Davenport [6] proposed a corrected statistic

FF = (U − 1)X2F

U(h − 1)− X2F

. (17)

Applying this correction we obtain FF = 5.68. With fourmethods and 32 experiments, FF is distributed according to theF distribution with 3 and 91 degrees of freedom. The criticalvalue of F(3, 93) for 0.05 is 0.12. As the value of FF is higherthan 0.12 we can reject the null hypothesis.

Furthermore, we perform a Nemenyi test in order to checkif any of these methods can be singled out [6], the Nemenyistatistic is obtained as follows:

CD = qα

√h(h + 1)

6U. (18)


Fig. 9. (a) Overlapping metric per method and number of don’t care bits for the head turn behavioral pattern. (b) Accuracy value for each one of thecompared methods and number of don’t care bits for the head turn pattern. (c) Overlapping metric for each method and number of don’t care bits for theTorso in table pattern. (d) Accuracy metric per each method and number of don’t care bits for the Torso in table behavioral pattern. (e) Overlapping metricand number of don’t care bits for the classmate desk invasion behavioral pattern. (f) Accuracy value per each compared method and number of don’t care bitsfor the classmate desk invasion behavioral pattern. (g) Overlapping metric per method and number of don’t care bits for the movement pattern. (h) Accuracyvalue and number of don’t care bits for the movement behavioral pattern.

In our case, for k = 4 DTW methods to compare andN = 32 experiments the critical difference value (CD) fora 95% of confidence is CD0.95 = 2.569 · √

20/192 = 0.8291.

As a result of the nonstandard DTW methods intersect withour proposal of DTW GMM or DTW APE which is thebest in mean ranking. This results are highly desirable since


TABLE IIIPERFORMANCE OF THE COMPARED METHODOLOGIES

IN TERMS OF OVERLAPPING

TABLE IVPERFORMANCE OF THE COMPARED METHODOLOGIES

BASED ON THE ACCURACY METRIC

TABLE VMEAN RANKS FOR EACH METHOD AND CERTAIN DON’T

CARE VALUES

they supports the fact that the proposed methodologies obtaina statistically significant improvement in performance whencompared to standard DTW approaches. For completion, wealso compute the CD0.90 and CD0.75; results are shownin Fig. 10.

These results support the fact that our proposal DTW APEis statistically better than the standard DTW approaches sincethe CD for a 95% is smaller than the difference in rankingbetween the proposed DTW APE method and standard DTW,thus obtaining very satisfying results while keeping similar

Fig. 10. Mean rank per method and CD interval for difference confidencevalues.

computational complexity. In addition, though our contribu-tion can be applied to any general purpose gesture recognitionsystem, from a clinical point of view, the presented analy-sis were reported as relevant by physicians involved in theproject and specialists on ADHD from hospitals in the area ofCatalonia.

V. CONCLUSION

In this paper, we presented an extension of the DTW algo-rithm in order to handle the intraclass variability of a gestureclass. This variability was encoded using one-class classifiers,such as, GMMs and APEs. In order to be able to embed theseclassifiers in the DTW context, the association cost was rede-fined to take into account the properties of such classifiers.We applied this extension in a real world problem, detectingADHD behavioral patterns to support clinicians in diagnosepurposes. In our experiments, on a novel multimodal ADHDdataset, the proposed methodology obtained statistically sig-nificant improvements with respect to DTW techniques whileobtaining relevant classification rates from a clinical pointof view.

The results of this paper motivate the use of the proposedtechniques with a much broader set of ADHD behavioral pat-terns in order to provide additional information to the clinician.Moreover, the presented methodology represents a signifi-cant contribution for general purpose human behavior analysissystems.

REFERENCES

[1] K. Mikolajczyk et al., “A comparison of affine region detectors,” Int. J.Comput. Vis., vol. 65, nos. 1–2, pp. 43–72, 2005.

[2] R. Bakeman and V. Quera, Sequential Analysis and ObservationalMethods for the Behavioral Sciences. New York, NY, USA:Cambridge Univ. Press, 2011.

[3] M. Bautista et al., “Probability-based dynamic time warping for ges-ture recognition on RGB-D data,” in Proc. Int. Conf. Pattern Recognit.Workshops (WDIA), Tsukuba, Japan, 2012, pp. 126–135.

[4] J. Carletta, “Squibs and discussions assessing agreement on classifi-cation tasks: The Kappa statistic,” Comput. Linguist., vol. 22, no. 2,pp. 249–254, 1996.

[5] P. Casale, O. Pujol, and P. Radeva, “Approximate convex hulls fam-ily for one-class classification,” in Multiple Classifier Systems. Berlin,Germany: Springer, 2011, pp. 106–115.

[6] J. Demsar, “Statistical comparisons of classifiers over multiple data sets,”J. Mach. Learn. Res., vol. 7, pp. 1–30, Jan. 2006.

[7] S. Escalera, A. Fornés, O. Pujol, J. Lladós, and P. Radeva, “Circularblurred shape model for multiclass symbol recognition,” IEEE Trans.Syst., Man, Cybern. B, Cybern., vol. 41, no. 2, pp. 497–506, Apr. 2011.


[8] R. Girshick, J. Shotton, P. Kohli, A. Criminisi, and A. Fitzgibbon,“Efficient regression of general-activity human poses from depthimages,” in Proc. Int. Conf. Comput. Vis. (ICCV), Barcelona, Spain,2011, pp. 415–422.

[9] D. Gong and G. Medioni, “Dynamic manifold warping for view invariantaction recognition,” in Proc. Int. Conf. Comput. Vis. (ICCV), Barcelona,Spain, 2011, pp. 571–578.

[10] A. Hampapur et al., “Smart video surveillance: Exploring the concept ofmultiscale spatiotemporal tracking,” IEEE Signal Process. Mag., vol. 22,no. 2, pp. 38–51, Mar. 2005.

[11] J. Hashemi et al., “Computer vision tools for the non-invasiveassessment of autism-related behavioral markers,” arXiv preprintarXiv:1210.7014, 2012.

[12] A. Hernndez-Vela, M. Reyes, V. Ponce, and S. Escalera, “Grabcut-based human segmentation in video sequences,” Sensors, vol. 12, no. 11,pp. 15376–15393, 2012.

[13] J. R. Landis and G. G. Koch, “The measurement of observer agreementfor categorical data,” Biometrics, vol. 33, no. 1, pp. 159–174, 1977.

[14] J. López-Ibor, CIE-10: Trastornos Mentales y del Comportamiento.Madrid, Spain: Meditor, 1992.

[15] B. D. Lucas and T. Kanade, “An iterative image registration techniquewith an application to stereo vision,” in Proc. 7th Int. Joint Conf. Artif.Intell., vol. 2. San Francisco, CA, USA, 1981, pp. 674–679.

[16] J. Nair, U. Ehimare, B. Beitman, S. Nair, and A. Lavin, “Clinicalreview: Evidence-based diagnosis and treatment of ADHD in children,”Missouri Med., vol. 103, no. 6, p. 617, 2006.

[17] M. A. Nicolaou, V. Pavlovic, and M. Pantic, “Dynamic probabilisticCCA for analysis of affective behavior,” in Computer Vision—ECCV2012. Berlin, Germany: Springer, pp. 98–111.

[18] DSM-IV Draft Criteria, Amer. Psychiatr. Assoc. Task Force DSM-IV,Washington, DC, USA, 1993.

[19] S. E. O. Lopes, M. Reyes, and J. González, “Spherical blurred shapemodel for 3-D object and pose recognition: Quantitative analysis andHCI applications in smart environments,” IEEE Trans. Syst., Man,Cybern. B, Cybern., vol. 44, no. 12, pp. 2379–2390, Dec. 2014.

[20] N. Otsu, “A threshold selection method from gray-level histograms,”Automatica, vol. 11, nos. 285–296, pp. 23–27, 1975.

[21] A. Pentland, “Socially aware computation and communication,”Computer, vol. 38, no. 3, pp. 33–40, Mar. 2005.

[22] R. Benavente, M. Vanrell, and R. Baldrich, “A data set for fuzzy colornaming,” Color Res. Appl., vol. 31, no. 1, pp. 48–56, 2006.

[23] M. Reyes, G. Dominguez, and S. Escalera, “Feature weighting indynamic time warping for gesture recognition in depth data,” in Proc.Int. Conf. Comput. Vis. Workshops (ICCV), Barcelona, Spain, 2011,pp. 1182–1188.

[24] C. Rother, V. Kolmogorov, and A. Blake, “GrabCut: Interactive fore-ground extraction using iterated graph cuts,” in Proc. ACM SIGGRAPHPapers, Los Angeles, CA, USA, 2004, pp. 309–314.

[25] J. Shotton et al., “Real-time human pose recognition in parts from singledepth images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,Washington, DC, USA, 2011, pp. 1297–1304.

[26] M. Svensen and C. M. Bishop, “Robust Bayesian mixture modeling,” inProc. 13th Eur. Symp. Artif. Neural Netw. (ESANN), Bruges, Belgium,2005, pp. 235–252.

[27] B. Yang, J. Cui, H. Zha, and H. Aghajan, “Visual context basedinfant activity analysis,” in Proc. 6th Int. Conf. Distrib. SmartCameras (ICDSC), Hong Kong, 2012, pp. 1–6.

[28] H.-D. Yang, S. Sclaroff, and S.-W. Lee, “Sign language spotting witha threshold model based on conditional random fields,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 31, no. 7, pp. 1264–1277, Jul. 2009.

[29] F. Zhou and F. De la Torre, “Generalized time warping formulti-modal alignment of human motion,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit. (CVPR), Providence, RI, USA, 2012,pp. 1282–1289.

[30] F. Zhou, F. De la Torre, and J. K. Hodgins, “Hierarchical aligned clusteranalysis for temporal clustering of human motion,” IEEE Trans. PatternAnal. Mach. Intell., vol. 35, no. 3, pp. 582–596, Mar. 2013.

Miguel Ángel Bautista received the B.Sc. degree incomputer science from the Universitat de Barcelona,Barcelona, Spain, and the M.Sc. degree in artifi-cial intelligence from the Universitat Politécnica deCatalunya, Barcelona, in 2010. He is currently pur-suing the Ph.D. degree in error correcting outputcodes as a theoretical framework to treat multiclassand multilabel problems.

He is a Research Member with the ComputerVision Center, Universitat Autonoma de Barcelona,Barcelona, the Applied Mathematics and Analysis

Department, Universitat de Barcelona, and the BCN Perceptual ComputingLaboratory and Human Pose Recovery and Behavior Analysis Group,University of Barcelona. His current research interests include machine learn-ing, computer vision, and convex optimization and its applications into humangesture analysis.

Mr. Bautista was the recipient of the First Prize from the CatalanAssociation of Artificial Intelligence Thesis Awards in 2010.

Antonio Hernández-Vela received the bachelor’sdegree in computer science and the M.S. degreein computer vision and artificial intelligence, bothfrom the Universitat Autònoma de Barcelona (UAB),Barcelona, Spain, in 2009 and 2010, respectively.He is currently pursuing the Ph.D. degree from theUniversity of Barcelona, Barcelona.

He is currently a Research Member with theComputer Vision Center, UAB. His current researchinterests include application of computer vision andartificial intelligence techniques to projects that can

help impaired people to improve their life quality, especially in the area ofhuman pose recovery and behavior analysis.

Mr. Hernández-Vela is a member of the BCN Perceptual ComputingLaboratory Research Group and Human Pose Recovery and Behavior AnalysisGroup, University of Barcelona.

Sergio Escalera received the B.S. and M.S. degreesfrom the Universitat Autònoma de Barcelona (UAB),Barcelona, Spain, in 2003 and 2005, respectively,and the Ph.D. degree in multiclass visual catego-rization systems from the Computer Vision Center,UAB.

He has lead the Human Pose Recovery andBehavior Analysis Group, University of Barcelona,Barcelona. His current research interests includemachine learning, statistical pattern recognition,visual object recognition, and human computer inter-

action systems, with special interest in human pose recovery and behavioranalysis.

Mr. Escalera was the recipient of the 2008 Best Thesis Award on ComputerScience, UAB.

Laura Igual received the degree in mathematicsfrom the Universitat de Valencia, Valencia, Spain,in 2000, and the Ph.D. degree in computer scienceand digital communication from the Department ofTechnology, Universitat Pompeu Fabra, Barcelona,Spain, in 2006.

Since 2006, she has been a Research Member atthe Computer Vision Center of Barcelona, and since2009, she has been a Lecturer at the Departmentof Applied Mathematics and Analysis, Universitatde Barcelona, Barcelona. Her current research inter-

ests include medical imaging with focus on neuroimaging, computer vision,machine learning, and mathematical models and variational methods for imageprocessing.

Dr. Igual is a member of the Perceptual Computing Laboratory and aConsolidated Research Group of Catalonia.


Oriol Pujol received the degree in telecommunica-tions engineering from the Universitat Politècnica deCatalunya, Barcelona, Spain, in 1998, and the Ph.D.degree in computer science from the UniversitatAutònoma de Barcelona (UAB), in 2004, where hespecialized in deformable models, fusion of super-vised and unsupervised learning, and intravascularultrasound image analysis.

In 1998, he joined the Computer Vision Centerand the Computer Science Department, UAB. Hewas an Associate Professor at the Department of

Matemàtica Aplicada i Anàlisi, Universitat de Barcelona, Barcelona, in 2005.Dr. Pujol is a member of the BCN Perceptual Computing Laboratory. Since

2004, he has been an active member in the organization of several activitiesrelated to image analysis, computer vision, machine learning, and artificialintelligence.

Josep Moya received the Doctor degree in medicine,psychiatry, and psychoanalyst.

He is with the Mental Health Department, ParcTaulí, Barcelona, Spain, he is also the Leader ofthe Observatory of Communitarian Mental Healthof Catalonia, and a Teacher with the Department ofSocial Wellness and Family of the Generalitat deCatalunya, Barcelona, the Center for Legal Studiesand Specialized Training with the Department ofJustice of the Generalitat de Catalunya. He is thePresident of the Private Catalan Foundation for

Research and Evaluation of Psychoanalytic Practice. He leads a researchproject on the Impact of the Economic Crisis on the Mental Health of thePopulation. He has published several articles on attention deficit hyperactivitydisorder.

Verónica Violant received the Ph.D. degree in psy-chology from Ramon Llull University, Barcelona,Spain.

She is a Tenured Professor with the Didactic andEducational Organization Department, University ofBarcelona, Barcelona, where she leads the graduatecourse on pedagogics, childhood, and disease. Hercurrent research interests include hospital pedagog-ics, paediatrics, and neonatology. She has authoredvarious publications on the attentiveness on diseasesin childhood and youth.

Dr. Violant was the recipient of the Diamond Prize of Research from theInternational Awards of the Eureka Sciences in 2012. She is a member of theResearch Group for Socio-Educational Interventions in Childhood and Youth.

María T. Anguera received the degree in lawfrom the University of Barcelona, Barcelona, Spain,and the Ph.D. degree in philosophy and humanities(psychological section).

Since 1986, she has been a DistinguishedProfessor at the Department of Behavioral ScienceMethodologies, University of Barcelona, where shehas a long teaching trajectory together with sev-eral research participations at foreign universities.She was the Vice-Rector of Scientific Politics atthe University of Barcelona. She has also advised

several Ph.D. dissertations and has published over 100 journal papers on psy-chology.

Dr. Anguera has been a member of the Steering Doctorate Committee since2011. She is an Academic with the Spanish Royal Academy of Medicine,Madrid, Spain.

Date post:	19-Oct-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

136 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 46, NO. 1 ...€¦ · 136 IEEE TRANSACTIONS ON...

Documents