Context-based hand gesture recognition for the operating room

Pattern Recognition Letters xxx (2013) xxx–xxx

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier .com/locate /patrec

Context-based hand gesture recognition for the operating room

0167-8655/$ - see front matter � 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.patrec.2013.05.024

⇑ Corresponding author. Tel.: +1 765 496 7380.E-mail addresses: [email protected] (M.G. Jacob), [email protected]

(J.P. Wachs).

Please cite this article in press as: Jacob, M.G., Wachs, J.P. Context-based hand gesture recognition for the operating room. Pattern Recognition Lett.http://dx.doi.org/10.1016/j.patrec.2013.05.024

Mithun George Jacob, Juan Pablo Wachs ⇑School of Industrial Engineering, Purdue University, West Lafayette, IN 47906, USA

a r t i c l e i n f o a b s t r a c t

Article history:Available online xxxx

Communicated by Luis Gomez Deniz

Keywords:Continuous gesture recognitionOperating roomHuman computer interaction

A sterile, intuitive context-integrated system for navigating MRIs through freehand gestures during aneurobiopsy procedure is presented. Contextual cues are used to determine the intent of the user toimprove continuous gesture recognition, and the discovery and exploration of MRIs. One of the chal-lenges in gesture interaction in the operating room is to discriminate between intentional and non-inten-tional gestures. This problem is also referred as spotting. In this paper, a novel method for traininggesture spotting networks is presented. The continuous gesture recognition system was shown to suc-cessfully detect gestures 92.26% of the time with a reliability of 89.97%. Experimental results show thatsignificant improvements in task completion time were obtained through the effect of contextintegration.

� 2013 Elsevier B.V. All rights reserved.

1. Introduction

Due to advances in computer-assisted surgery, human–computer interaction (HCI) in the operating room (OR) is graduallybecoming commonplace. Several surgical procedures such astumor resections mandate the use of computers (WHO, 2009)intra-operatively and during pre-operative planning. Since HCI de-vices are possible sources of contamination due to the difficulty insterilization, clinical protocols have been devised to delegate con-trol of the terminal to a sterile human assistant (Albu, 2006;Schultz et al., 2003; Oliveira et al., 2012). However, this mode ofcommunication has been shown to be cumbersome (Maintz andViergever, 1998) and prone to errors (Albu, 2006) and therefore in-crease the overall duration of the procedure. As a secondary effect,such indirect interaction could increase the surgeon’s cognitiveload (Firth-Cozens, 2004; Halverson et al., 2011; Lingard et al.,2004) and highlights the need for a sterile method of HCI in theoperating room.

Computer systems used to navigate MRI images before and dur-ing the surgery (PACs) (Boochever, 2004; Lemke and Berliner,2011; Mulvaney, 2002) conventionally requires the use of key-board, mice or touchscreens for MRI browsing. This paper proposesa sterile method for the surgeon to naturally, and efficientlymanipulate MRI images through touchless, freehand gestures(Gallo et al., 2011; Ebert et al., 2011; Johnson et al., 2011; Mentiset al., 2012; Micire et al., 2009).

Image manipulation through gestural devices has been shownto be natural and intuitive (Ebert et al., 2011; Hauptmann, 1989)and does not compromise the sterility of the surgeon. An exam-ple of a touchless mouse (Gratzel et al., 2004) utilizes stereo vi-sion to localize the hand in 3D which allows the user to controlthe interface with hand gestures. A multimodal solution (Keskinet al., 2007) for obtaining patient input using gestures has alsoshown to be effective. Systems based on voice recognition havealso been utilized in the OR such as AESOP, a voice controlled ro-botic arm which handle a camera during surgery (Mettler et al.,1998). The main drawback with voice recognition systems arethe long reaction times, erratic responses and user dependency(Nishikawa et al., 2003). The uncontrolled and noisy environmentcharacteristic to the OR has led to the development of gesture-based (Jacob, 2011; Jacob et al., 2012) interfaces for the operatingroom.

The need for sterile image manipulation has led to the develop-ment of touchless HCI based on the use of facial expressions(Nishikawa et al., 2003), hand and body gestures (Gratzel et al.,2004; Keskin et al., 2007; Zeng et al., 1997; Grange et al., 2004)and gaze (Nishikawa et al., 2003; Yanagihara and Hama, 2000). Itshould be noted that none of this research have incorporated sur-gical contextual cues to disambiguate recognition of false gesturesand improve gesture recognition performance.

An alternate modality is proposed to replace HCI devices suchas the keyboard, mouse, and touch-screens traditionally used tonavigate and manipulate a sequential set of MRI images. The pro-posed system extends the work previously developed by theauthors (Wachs et al., 2008; Jacob et al., 2012) with the use of dy-namic two-handed gestures and contextual knowledge. Addition-ally, the authors provide a novel, analytical method to optimize

(2013),

http://dx.doi.org/10.1016/j.patrec.2013.05.024

mailto:[email protected]

mailto:[email protected]


http://www.sciencedirect.com/science/journal/01678655

http://www.elsevier.com/locate/patrec


2 M.G. Jacob, J.P. Wachs / Pattern Recognition Letters xxx (2013) xxx–xxx

the gesture recognition system using a priori data and haveintroduced a new contextual cue for improved navigationalperformance.

2. System overview

2.1. MRI image browser

An MRI browser was developed with OpenGL (Shreiner et al.,2005) and OpenCV (Bradski, 2000) for the navigation and manipu-lation of MRI images inspired by the OsiriX system (Rosset et al.,2004). Several MRI sequences and slices within sequences are dis-played in the browser for selection, navigation and manipulation.The browser supports a set of 10 commands typically used duringsurgery to manipulate and revise MRI images. The commands andcorresponding gestures were obtained by consulting nine veteri-nary surgeons that are used to working with MRI browsing soft-ware. Additionally, the image browser can also be operatedthrough the keyboard/mouse.

The lexicon consists of ten gestures (see Fig. 1(a)) which encom-passes image manipulation tasks such as zooming (zoom in, andzoom out), rotation (clockwise, and counter-clockwise) and bright-ness change (brightness up, and brightness down) as well as imagenavigation tasks such as browsing (up, down, left, and right).

2.2. Intent and gesture classification

Anthropometric information of the user was obtained through aMicrosoft Kinect (Kinect – Xbox.com, 2012) using the OpenNI SDK(OpenNI, 2012). This includes the 3D coordinates of the left andright shoulders, head, and both hands (see Fig. 1(b)). The positionsof the shoulders and head were used to classify the pose of the user(the user is assumed to make intentional gestures when facing thesystem). The position of the both hands is used to obtain the trajec-tory of the hands over time (only hand positions during an ‘‘inten-tional’’ pose are recorded). These cues are henceforth referred to asvisual contextual cues.

Fig. 1. (a) Gesture lexicon (b) skeleton model and tracked marker-less points.

Please cite this article in press as: Jacob, M.G., Wachs, J.P. Context-based hand ghttp://dx.doi.org/10.1016/j.patrec.2013.05.024

Non-visual contextual cues include the history of commandsperformed during the user interaction as well as the time betweensuccessive commands. Another important non-visual cue is in-ferred from the biopsy procedure. The tip of the biopsy needle inthe brain is tracked as it penetrates the patient’s brain tissue (seeFig. 2). Since the MRI images of interest are usually related to theposition of the needle tip inside the brain, this location is mappedto a sequence and a slice from the available set of MRI images. TheMRI image browser uses this information to display the mappedslice saving the user from executing several navigational com-mands to reach the slice.

In our previous work (Jacob et al., 2012), we established that theinclusion of context significantly reduced the false positive rate ofgesture recognition. The following sections briefly describe theintegration of contextual cues within the recognition framework.

3. Intent classification through visual context

It has been shown that gaze is a critical contextual cue used todetermine the intent of a user to interact with an entity (Emery,2000). Other important anthropometric cues (Langton, 2000) suchas torso orientation have also been used to aid in intentrecognition.

3.1. Torso orientation (Th)

The orientation of the torso is computed using the skeletal jointcoordinates in 3D. The coordinates of the left and right shoulderwas used to compute the azimuth orientation Th w.r.t the X-axis.A pose is counted as intentional if the user faces the Kinect sensor(i.e. Th is approximately 180�), this resembles the face to face inter-personal interaction.

3.2. Head orientation (H)

The skeletal coordinates of the user’s body is used to reduce thesearch space for the head. The Viola–Jones frontal face detector(Viola and Jones, 2004) is used to compute the location of the headin the reduced search space. Since the aforementioned detector is afrontal face classifier, the mean H of the binary response of thedetector (detected/not detected) is computed over a window ofW = 10 frames. A high value of H corresponds to a high degree ofconfidence that the head is oriented towards the sensor.

Fig. 2. Experimental setup with the mock biopsy needle inside the model’s head.

esture recognition for the operating room. Pattern Recognition Lett. (2013),


PG

PT PT PTGesture λ1

. . . PA

ThM.G. Jacob, J.P. Wachs / Pattern Recognition Letters xxx (2013) xxx–xxx 3

3.3. Integration of visual contextual cues

A dataset of 125,846 intentional and unintentional poses was col-lected (see Section 6.3) and manually annotated. The cues Th, H werecomputed for each pose and used to train a pruned decision tree with10-fold cross validation. The pruning level l was varied to determinethe optimal tree for intent classification (Jacob et al., 2012).

4. Continuous gesture recognition with Hidden Markov Models(HMMs)

HMMs have been a popular choice for gesture recognition (Leeand Kim, 1999; Elmezain et al., 2009; Bernardin et al., 2005; Junkeret al., 2008; Malgireddy et al., 2010, 2012) since it readily modelssequential data over time. Broadly speaking, HMMs can be usedto efficiently solve two problems in gesture recognition; isolatedand continuous gesture recognition. Isolated gesture recognitionusually classifies a segmented gesture whereas continuous gesturerecognition attempts to segment and classify a gesture from astream of data.

4.1. Isolated gesture recognition

Given a lexicon of gestures, and quantized segmented samplesof each gesture, a single N-state HMM k = {p,A,B} (see Fig. 3) istrained per gesture using the Baum–Welch algorithm (Rabiner,1989) where p is initial state distribution, A is the state transitionmatrix and B is observation symbol probability distribution for adiscrete set of symbols V. Then, given a sequence of discrete sym-bols O (obtained from quantizing the trajectory of the hands), theViterbi algorithm (Rabiner, 1989) is used to computekG ¼ argmax

k2KPðOjkÞ and thus determine the corresponding gesture.

The following quantization scheme is used. Let the xn;l; xn;r 2 R3

be the real-world 3D position of the left and right hand respec-tively in frame n. The velocity of a hand h 2 {l,r} in frame n is givenby vn,h = xn,h � xn�1,h. Each element in vn,h is quantized to a symbolin {�,0,+} using (1) where u is a velocity component along one ofthe x, y, z axes. Note that the quantization threshold s P 0 is usedto determine whether the hand is stationary along an axis.

SðuÞ ¼� u 6 �r

þ u P �r

0 otherwise

8><>: ð1Þ

Therefore, the continuous velocity vectors for each hand mn;l; mn;r

can be concatenated and quantized into a set of discrete symbols.Since the set has only 36 representations, the velocity of bothhands at a frame n can be uniquely identified by a single discretesymbol. Note that these 36 symbols form the observation symbolset V (Rabiner, 1989) for each HMM.

Since kG ¼ argmaxk2K

PðOjkÞ, it is apparent that even if O does not

belong to a gesture (i.e. a non-gestural movement), a kG (a gestureclass) will be selected. Additionally, a simple threshold of PðOGjkÞ isinsufficient (Lee and Kim, 1999) and has led to the development ofthreshold models (an additional HMM) which provides an adaptivethreshold used to discriminate between gestures and non-gestures.

1 2 3 4 5

Fig. 3. 5-state left–right (or Bakis) absorbing HMM.


4.2. Gesture spotting networks

The independently trained HMM models (per gesture) k 2 K arecombined (see Fig. 4) with the threshold model kT into a gesturespotting network (GSN) used for continuous gesture recognition.Since the Viterbi algorithm can be used to provide the most prob-able state sequence corresponding to a sequence of symbols O,symbols corresponding to a non-gesture OT are matched withstates in the threshold model kT and symbols corresponding to agesture OG correspond to a gesture model kG 2 K.

The threshold model kT utilizes as many states as all the gesturemodels put together. In this paper, a gesture model kk ¼ fpk;Ak;Bkghas five states and since there are 10 gestures in the lexicon, kT hasNT = 50 states. The observation probability distribution of eachstate in the threshold model is the same as its gesture model coun-terpart. The initial and state transition probability distributions areuniform across all states (the model is fully connected) such thatAT(i, j) = 1/NT. Note that when the threshold model is incorporatedinto the network, the transition probability from the start state Sto any threshold state is PT and inter-state transitions within thethreshold model is equal to 1/(NT + 1).

Similarly, since the gesture models are absorbing Markovchains, the self-transition probability of the terminal state is 1and the initial state distribution is pk = [10000]T. When incorpo-rated into the GSN, the self-transition probability of the terminalstate is set to PA � 1 (since the probability of reaching the S statefrom the terminal state is non-zero) and the transition probabilityfrom S to the first state of each gesture model is PG.

The GSN is optimized (Lee and Kim, 1999) by choosing PG and PT

such that

PGPðOGjkGÞ > PT PðOGjkTÞ ð2Þ

PT PðOT jkTÞ > PGPðOT jkGÞ ð3Þ

Condition (2) ensures that it is more probable for a Viterbi se-quence to pass through a gesture model (kG) than the thresholdmodel (kT ) given a gestural sequence OG. Similarly, condition (3)ensures that for a non-gestural sequence OT, a Viterbi sequencewhich passes through kT is preferred over kG. The probabilitiesPðOGjkGÞ; PðOGjkTÞ; PðOT jkTÞ and PðOT jkGÞ are computed from manu-ally annotated sequences of user interaction. Each annotated se-quence OG and OT provides one constraint corresponding to (2)and (3) respectively. Therefore, it would be advantageous if wecould determine values of PG, PT which satisfies as many of theseconstraints as possible (corresponding to correct classification ofsegments in the annotated sequence as gesture and non-gesture).

Prior work on GSNs determine PG, PT exhaustively (Lee and Kim,1999; Kelly et al., 2009) or through ad hoc (Elmezain et al., 2009)methods. The following sections present an analytical method todetermine these parameters efficiently.

.

.

.

S

PG

PGPA

Gesture λ2

Gesture λ10

. . .

. . .

PA

PA

reshold Model λ

T

Fig. 4. Gesture spotting network (GSN) for continuous gesture recognition




4.3. Optimizing GSNs

Let NG be the number of gestures in the lexicon, NT the numberof states in the threshold model and PA the self-transition probabil-ity of the terminal state in each gesture model. Since PG is a func-tion (4) of PT in this GSN,

PT NT þ PGNG þ PA ¼ 1 ð4Þ

conditions (2) and (3) can be rewritten as (5) and (6) respec-tively in the terms of PG alone (represented as x henceforth for sim-plicity of notation).

x >k1PðOGjkTÞ

k2PðOGjkTÞ þ PðOGjkGÞ¼ A ð5Þ

x <k1PðOT jkTÞ

k2PðOT jkTÞ þ PðOT jkGÞ¼ B ð6Þ

where

k1 ¼1� PA

NT; k2 ¼

NG

NTð7Þ

Therefore, given m gesture samples and n non-gesture samples(obtained from manually annotated sequences), determining theoptimal PG, PT can be cast as the following optimization problem.

maximize 1m

Xm

i¼1

Iðx > AiÞ þ 1n

Xn

i¼1

Iðx < BiÞ

s:t: 0 6 x 6 U

x 2 R

8>>><>>>:

ð8Þ

where U = k1/k2 is an upper bound on x from (4) and I(u) is anindicator function defined as

IðuÞ ¼1; u P 00; u < 0

�ð9Þ

Problem (8) can be formulated as a binary mixed integer pro-gram (MIP) (10) by modeling the indicator function (9) with binaryvariables.

maximize 1m

Xm

i¼1

yAi þ 1

n

Xn

i¼1

yBi

s:t: x P AiyAi

x 6 BiyBi þ ð1� yB

i ÞUx 2 R; yA 2 f0;1gm; yB 2 f0;1gn

8>>>>>><>>>>>>:

ð10Þ

The dataset used in this paper has m = 4700 and n = 2818 (i.e. mgestural sequences and n non-gestural sequences). Large values ofm, n results in a computationally expensive binary MIP (10). Forexample, the YALMIP (Lofberg, 2004) integer branch and boundsolver returned a sub-optimal solution after 35 s of computationon a 3 GHz Intel Xeon processor.

This paper presents the following theorem (with proof) whichshows that an optimal x can be found in O((m + n)log mn) time.

Theorem 4.1. A finite, optimal solution x⁄ (if one exists) to the followingproblem, subject to x�R can be found in O((m + n)log mn) time.

maximize0�x�U

1m

Xm

i¼1

Iðx > AiÞ þ1n

Xn

i¼1

Iðx < BiÞ ð11Þ

Proof. This immediately follows from Lemmas 4.2, 4.3 and 4.4. h

Lemma 4.2. Suppose x⁄ R D = AS

B is a finite, optimal solution toproblem (11).


a. Then x⁄ 2 (a,b) where a 2 A, b 2 B.

Proof. Since x� 2 R and A, B are finite, non-empty, discrete sets,there exists an interval I = (d1,d2) such that x⁄ 2 I where d1, d2 2 D.Since D = A

SB, I can be either I1 = (a0,b0), I2 = (a1,a2), or I3 = (b1,b2)

where ai 2 A, bi 2 B.

Suppose x⁄ 2 I2. Then let xa = a2 + e for e > 0. Also, let f(x) be theobjective function of problem (11). Then there exists an e such thatf(xa) > f(x⁄) which is a contradiction since x⁄ is optimal. A similarcontradiction arises if x⁄ 2 I3 since the objective function can be im-proved by choosing xb = b1 � e for e > 0. Hence, x⁄ 2 (a0,b0). h

b. Then there exists a finite, optimal solution x’ to the problem

maximize 1m

Xm

i¼1

Iðx P AiÞ þ 1n

Xn

i¼1

Iðx 6 BiÞ

s:t: x 2 D

8><>: ð12Þ

s.t. the objective function f(x0) = f(x⁄) � l where f(x) is the objec-tive function of problem (11) and l ¼min 1

m ;1n

� �.

Proof. From Lemma 4.2a, x⁄ 2 (a,b) where a 2 A, b 2 B. Therefore, ifwe maximize the objective function of problem (12) over the finitediscrete set D, x = a or x = b is optimum depending on m and n.

If 1/m > 1/n, constraints from A are valued more than con-straints from B. Therefore x0 = b (since x0 = b > a) which leads tothe reduction of one satisfied constraint from B and thusf(x0) = f(x⁄) � 1/n. Similarly, if 1/m < 1/n, x0 = a and f(x0) = f(x⁄) � 1/m. h

Lemma 4.3. The optimal solution x0 to (12) can be computed inO((m + n)log mn) time.

Proof. A, B are first sorted in O(m log m + n log n) time. To simplifynotation, the objective function f(x) in (12) is represented as

f ðxÞ ¼ 1m

SAðxÞ þ1n

SB ð13Þ

where SA ¼Pmi¼1

Iðx P AiÞ and SB ¼Pni¼1

Iðx 6 BiÞ.

For a candidate solution x = ai, SA = i and SB can be computed inO(log n) time using binary search over sorted B. Similarly for a can-didate solution x 2 B, SB can be computed in O(1) time and SA, inO(log m) time.

Therefore, the total time complexity of finding an optimal x’ isO((m + n)log mn). h

Lemma 4.4. Given an optimal solution x0 to (12), an optimal solutionx⁄ to problem (11) can be computed in O(m) time if 1/m > 1/n and O(n)time otherwise.

Proof. Since x⁄ 2 (a,b) (from Lemma 4.2) and x0 is an upper orlower bound on x⁄, x⁄ can be computed by finding the other boundand picking a value in the interval. If x0 2 A, then the closest valueb 2 B can be found in O(n) time and similarly if x0 2 B, the corre-sponding a can be found in O(m) time. h

It should be noted that Lemma 4.2a reduces the GSN trainingproblem to solving an instance of the weighted constraint satisfac-tion problem (WCSP) (Ansótegui et al., 2007) which in its generalform is NP-hard. Due to the structure of this instance of the WCSP,the above algorithm computes an optimal solution for PG inO((m + n)log mn) time.



M.G. Jacob, J.P. Wachs / Pattern Recognition Letters xxx (2013) xxx–xxx 5

Plht

Algorithm 1. Optimize GSN

Compute A, B from segmented gesture and non-gesturesequences using (5), (6) and sort A, B where D = A

SB

for i = 1 ? m dox ai, SA i, SB binary_search(x,B)

Si ¼ 1m SA þ 1

n SB

end

for i = 1 ? n dox bi, SA binary_search(x,A), SB n � i + 1

Siþm ¼ 1m SA þ 1

n SB

endreturn PG D argmax Si

i�f1;...;mþng

!

4.4. Further optimization

The scores A, B computed in (5) and (6) use probabilities PðOjkGÞand PðOjkTÞ. These probabilities are computed with the Viterbialgorithm on the gesture ðkGÞ and threshold (kT ) models.

PðOjkGÞ is close to P0ðOjkGÞ (where P0 is the probability computedfrom the sub-model in the network) since the self-transition prob-ability of the terminal node is set to PA � 1. Usually, obtainingP0ðOjkTÞ from PðOjkTÞ is difficult (Lee and Kim, 1999) and is as-sumed to be close enough. In this GSN, the transition probabilitybetween any two nodes in the threshold model is uniform (1/NT)providing a simple relation between PðOjkTÞ and P0ðOjkTÞ. This rela-tion allows for the computation of accurate values of A, B.

Since the state transition probability in kT is 1/NT, the Viterbiscore for a sequence O of length L is

PðOjkTÞ ¼1

NT

� �L

b ð14Þ

where b is the product of the observation probabilities corre-sponding to the symbols in O. In the GSN, every state in the thresh-old model is connected to the S state as well. Therefore,

P0ðOjkTÞ ¼1

NT þ 1

� �L

b ð15Þ

Using (16) we can accurately compute P0ðOjkTÞ and thus A;B.

P0ðOjkTÞ ¼NT

NT þ 1

� �L

PðOjkTÞ ð16Þ

5. Integration of non-visual context

Non-visual contextual cues are used to determine the probabil-ity of observing the given gesture which was segmented and recog-nized from the GSN. The probability of observing the gesture w.r.tthe previous command as well as the delay between commands isthresholded to ensure that the gesture from the GSN is not a falsepositive.

5.1. Delay between commands

The delay before a command is issued (tD) provides predictiveinformation about the command. The time delay between naviga-tional commands (up, down, left and right) is much shorter thanthe delay between manipulation commands (zoom in/out, in-crease/decrease brightness). The time between commands is as-sumed to be normally distributed and was fitted to a normal

ease cite this article in press as: Jacob, M.G., Wachs, J.P. Context-based hand gtp://dx.doi.org/10.1016/j.patrec.2013.05.024

distribution Nðlk;rkÞ for a gesture k using data collected fromobserving user interactions.

5.2. Command history

Due to the nature of an interface, the probability of observing acommand Ct given the previous command Ct-1 is not uniform.Therefore, a transition matrix A (where aij represents the probabil-ity of observing a command i after command j) is learned from userinteractions by modeling the sequence of commands as a first or-der Markov chain.

5.3. Integration of non-visual contextual cues

Given the candidate gesture k from the GSN, correspondingcommand Ct, the time between commands tD, and the previouscommand Ct�1, a gesture k is accepted if the probability of observ-ing the candidate gesture Pk (given below) exceeds an empiricallydetermined threshold e.

Pk ¼ PðkjtDÞPðkjCt�1Þ ð17Þ

Additionally, the position of the tip of the biopsy needle in thepatient’s brain during the procedure (see Fig. 2) provides additionalcontextual information. Infrared LEDs was mounted on a casewhich contained the circuitry to power the LEDs. The position ofthe LEDs was determined through simple blob analysis on the IRimage obtained from the Kinect, using algorithms similar to theones used by the Wii. The 3D information at (and around) the LEDsis unreliable (since the Kinect used structured IR LEDs to compute3D information), the 3D coordinates of the points between theLEDs are computed. A 3D line is fit to these points and then extrap-olated given that the Euclidean distance between the center of theline and the tip of the needle is fixed. The tip of the needle was thencalibrated with respect to the entry point on the mock model.

During the experiment, a participant inserted the needle intothe head of the mock model (see Fig. 2). Modeling clay was in-serted into the head cavity to provide tactile feedback. The depthof the needle in the head was computed using the extrapolated po-sition of the tip of the needle. This contextual information wasused to determine the appropriate MRI image sequence and slice.

6. Experiments

6.1. Isolated gesture recognition

A dataset of 1000 gestures was collected from 10 subjects (10gestures per user per class). 10-fold cross validation (see Fig. 5(a))was used to determine the optimal value of an operating parameter(quantization threshold s). The value of s = 25 mm was found to beoptimal (see Fig. 5(b)) with a mean recognition accuracy of 96.41%.

6.2. Continuous gesture recognition

A dataset of 900 gestures was collected from nine subjects. Eachsubject performed a gesture from the lexicon (of 10 gestures) 10times. The continuous gesture recognition system was trainedusing the optimally trained models obtained from the modelstrained for isolated gesture recognition in the previous experiment.The detection D and reliability R of the system (Lee and Kim, 1999)was computed for each gesture (see Table 1) using the number ofinsertion errors (false positives), the number of correctly detectedgestures C and the number of performed gestures (18). The meandetection and reliability rate was computed as 92.26% and 89.97%.

D ¼ CN

; R ¼ CN þ 1

ð18Þ



0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

clockwisecounter-clockwiseleftrightzoom inzoom outupdownbrightness downbrightness upMean

100

0.92

5.56

5.50

97.22

2.75

1.83

2.78

91.74

3.70

0.93

93.52

100

100

90.83

0.93

1.85

100

0.93

0.93

4.59

94.44

0.93

1.83

96.30

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 5. (a) Mean gesture recognition accuracy vs. quantization threshold s (b)confusion matrix for s = 25 mm.

Table 1Detection and reliability of continuous gesture recognition.

Gesture Detection (%) Reliability (%)

Clockwise 90.32 90.32Counterclockwise 94.81 94.81Left 85.92 83.07Right 97.14 94.18Zoom-in 79.81 73.97Zoom-out 92.26 91.54Up 96.97 94.28Down 92.70 86.35Brightness-down 96.97 96.97Brightness-up 95.76 94.20Mean 92.26 89.97

0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019 0.020.986

0.988

0.99

0.992

0.994

0.996

0.998

1Intention ROC curve

Fig. 6. ROC curve for intention recognition.


6.3. Intention recognition

A dataset of 125846 poses were collected from 20 subjects. Thedata was collected from Kinect recordings of a usability studywhere subjects interacted with the MRI viewer running an earlierversion of the system (Jacob et al., 2012). 42.47% represented unin-tentional poses (where the subject did not intend to make a ges-ture) and the rest represented intentional poses. A pruneddecision tree was trained on the collected data to accurately recog-nize the intent of the subject. 10-fold cross validation was con-ducted to determine the optimal pruning level l for the decisiontree (see Fig. 6). The optimal operational parameter l = 28 wasfound to have a true positive rate of 99.55% and a false positive rateof 1.3%.


6.4. Usability study

A usability study with 19 subjects was conducted to comparedifferent interaction paradigms. The male/female ratio was 13:6with male subjects aged 19–29 years (mean ± SD 23.46 ± 3.02)and female subjects aged 18–28 years (mean ± SD 21.66 ± 4.32).In the first paradigm (henceforth referred to as the ‘‘Context’’),the subject used hand gestures with integrated context to browse,select, zoom and rotate radiological images. The second paradigm(henceforth referred to as ‘‘Assistant’’) involved an assistant whoused the keyboard and mouse to search for images based on verbalcommands from the subjects. The interaction sequences and timetaken to complete the task of finding and manipulating the land-mark image of interest was recorded. Each subject worked on threebiopsy sites on the mock model and performed the aforementionedtask a total of 12 times for each interaction paradigm. After themock operations, the subject completed a questionnaire on a Likertscale (maximum of five) regarding their experience which re-corded their opinion of ease of use, naturalness and precision whenusing paradigm Context compared to paradigm Assistant.

The analysis of task completion times revealed a learning curvein the Context paradigm. The learning rates for each subject (acrossthe aforementioned 12 trials) were computed and a representativelearning curve was selected by choosing the subject with the med-ian learning rate of 74.62%. The corresponding learning rate for therepresentative participant for the Assistant paradigm was 94.02%.The learning curves were fit using outlier rejection based on Cook’sdistance and it was observed that at least 10 of the 12 data pointsobtained from each trial was not rejected as outliers. Fig. 7 displaysthe learning curves for both paradigms. The low coefficient ofdetermination (R2 value) and lower learning rate for the learningcurve of the Assistant paradigm indicates that much learning doesnot occur. This is anticipated due to the assistant’s proficiency withthe keyboard and mouse.

It is also observed that the learning curves intersect close be-tween the 5th and 6th trial which indicates that a user can be ex-pected to improve in the usage of the Context paradigm and besuperior to the Assistant paradigm after the 6th trial. ANOVA wasconducted on the task completion times and it was observed thata significant (p < 0.05) improvement in the task completion timewas observed after the 9th trial (see Fig. 8).

The mean number of commands required to complete the taskwas compared for both paradigms. It was observed through ANO-VA that the Context (mean ± SD 4.56 ± 2.83 commands) requiredsignificantly (p < 0.05) less commands than the Assistant paradigm(mean 19.89 ± 2.86 commands).



1 2 3 4 5 6 7 8 9 10 11 120

5

10

15

20

25

30

35

40

45

50

55Context (Fit R2 = 0.75)

Assistant (Fit R2 = 0.51)ContextAssistant

Fig. 7. Learning curves for the representative participant for the Context andAssistant paradigms.

10 11 120

5

10

15

20

25

30ContextAssistantError

Fig. 8. Mean task completion times with error bars for trials 10, 11 and 12 for theContext and Assistant paradigms.

Ease of use Naturalness Precision

1

2

3

4

5

ContextAssistantError

Fig. 9. Mean questionnaire responses for the Context and Assistant paradigms.

M.G. Jacob, J.P. Wachs / Pattern Recognition Letters xxx (2013) xxx–xxx 7

ANOVA on the questionnaire data (see Fig. 9) indicated that theContext interaction paradigm was significantly (p < 0.05) more nat-ural than the Assistant paradigm. ANOVA on the other questions(ease of use and precision) did not indicate any statistically signif-icant observations.


7. Discussion and conclusion

It has been proven that exhaustive search for training the GSN isunnecessary and an efficient method for training GSNs has beendeveloped. The mean isolated gesture recognition accuracy of thesub-models used to build the GSN was found to be 96.41%. Exper-imental results show that the mean continuous gesture recognitiondetection ratio and reliability are 92.26% and 89.97% respectively.

Additionally, contextual cues have been successfully integratedto improve task completion performance. Intent recognition occursat 99.55% with a false positive rate of 1.3%. It was shown that userslearned to use the context–integrated interaction paradigm fasterthan the conventional assistant-based interaction paradigm. Thisis evinced by the significant reduction in task completion time overthe last 25% of the trials when utilizing the context-based system.Also, the context-based system was shown to be significantly morenatural than the assistant-based system upon analysis of the post-study questionnaire.

Future work includes testing the efficiency of the continuousgesture spotting network on other datasets and testing the systemin a real-world scenario such as the operating room.

Acknowledgments

This project was supported by grant number R03HS019837from the Agency for Healthcare Research and Quality (AHRQ).The content is solely the responsibility of the authors and doesnot necessarily represent the official views of the AHRQ. The AHRQhad no involvement in the study design; in the collection, analysisand interpretation of data; in the writing of the report; and in thedecision to submit the article for publication. We would like tothank Rebecca Packer for all her support in this project.

References

W. H. Organization and others, WHO guidelines for safe surgery 2009: safe surgerysaves lives, World Health Organization, 2009.

Albu, A., 2006. Vision-based user interfaces for health applications: a survey. Adv.Visual Comput., 771–782.

Schultz, M., Gill, J., Zubairi, S., Huber, R., Gordin, F., 2003. Bacterial contamination ofcomputer keyboards in a teaching hospital. Infect. Control Hosp. Epidemiol. 24(4), 302–303.

Oliveira, J.F., Capote, H., Pagador, J.B., Moyano, J.L., Margallo, F., 2012. Perspectiveson computer assisted surgery (CAS) in minimally invasive surgery and the roleof the CAS system nurse. Int. J. Bioinf. Biosci. 2 (4).

Maintz, J., Viergever, M.A., 1998. A survey of medical image registration. Med. ImageAnal. 2 (1), 1–36.

Firth-Cozens, J., 2004. Why communication fails in the operating room. Qual. SafetyHealth Care 13 (5), 327–327.

Halverson, A.L., Casey, J.T., Andersson, J., Anderson, K., Park, C., Rademaker, A.W.,Moorman, D., 2011. Communication failure in the operating room. Surgery 149(3), 305–310.

Lingard, L., Espin, S., Whyte, S., Regehr, G., Baker, G.R., Reznick, R., Bohnen, J., Orser,B., Doran, D., Grober, E., 2004. Communication failures in the operating room:an observational classification of recurrent types and effects. Qual. SafetyHealth Care 13 (5), 330–334.

Boochever, S.S., 2004. HIS/RIS/PACS integration: getting to the gold standard. Radiol.Manage. 26 (3), 16–24, Quiz 25–27.

Lemke, H.U., Berliner, L., 2011. PACS for surgery and interventional radiology:features of a therapy imaging and model management system (TIMMS). Eur. J.Radiol. 78 (2), 239–242.

Mulvaney, J., 2002. The case for RIS/PACS integration. Radiol. Manage. 24 (3), 24–29.L. Gallo, A.P. Placitelli, M. Ciampi, Controller-free exploration of medical image data:

experiencing the Kinect, in: Computer-Based Medical Systems (CBMS), 201124th International Symposium on, 2011, pp. 1–6.

Ebert, L.C., Hatch, G., Ampanozi, G., Thali, M.J., Ross, S., 2011. You cannot touch this:touch-free navigation through radiological images. Surg. Innovation.

R. Johnson, K. O’Hara, A. Sellen, C. Cousins, A. Criminisi, Exploring the potential fortouchless interaction in image-guided interventional radiology, in: Proceedingsof the 2011 annual conference on Human factors in computing systems, NewYork, NY, USA, 2011, pp. 3323–3332.

H. M. Mentis, K. O’Hara, A. Sellen, R. Trivedi, Interaction proxemics and image use inneurosurgery, in: Proceedings of the 2012 ACM annual conference on HumanFactors in Computing Systems, New York, NY, USA, 2012, pp. 927–936.


http://refhub.elsevier.com/S0167-8655(13)00222-5/h0005





























M. Micire, M. Desai, A. Courtemanche, K. M. Tsui, and H. A. Yanco, Analysis ofnatural gestures for controlling robot teams on multi-touch tabletop surfaces,in: Proceedings of the ACM International Conference on Interactive Tabletopsand Surfaces, New York, NY, USA, 2009, pp. 41–48.

Hauptmann, A.G., 1989. Speech and gestures for graphic image manipulation. ACMSIGCHI Bull. 20, 241–245.

Gratzel, C., Fong, T., Grange, S., Baur, C., 2004. A non-contact mouse for surgeon–computer interaction. Technol. Health Care Eur. Soc. Eng. Med. 12 (3), 245–258.

C. Keskin, K. Balci, O. Aran, B. Sankur, L. Akarun, A multimodal 3D healthcarecommunication system, in: 3DTV Conference, 2007, 2007, pp. 1–4.

Mettler, L., Ibrahim, M., Jonat, W., 1998. One year of experience working with theaid of a robotic assistant (the voice-controlled optic holder AESOP) ingynaecological endoscopic surgery. Hum. Reprod. 13 (10), 2748–2750.

Nishikawa, A., Hosoi, T., Koara, K., Negoro, D., Hikita, A., Asano, S., Kakutani, H.,Miyazaki, F., Sekimoto, M., Yasui, M., 2003. And others, FAce MOUSe: A novelhuman–machine interface for controlling the position of a laparoscope. IEEETrans. Rob. Autom. 19 (5), 825–841.

M.G. Jacob, Yu-Ting Li, J.P. Wachs, A gesture driven robotic scrub nurse, in: 2011IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2011,pp. 2039–2044.

M.G. Jacob, Y.-T. Li, J.P. Wachs, Gestonurse: a multimodal robotic scrub nurse, in:Proceedings of the seventh annual ACM/IEEE international conference onHuman–Robot Interaction, New York, NY, USA, 2012, pp. 153–154.

Zeng, J., Wang, Y., Freedman, M., Mun, S.K., 1997. Finger tracking for breastpalpation quantification using color image features. Opt. Eng. 36, 3455.

S. Grange, T. Fong, C. Baur, M/ORIS: a medical/operating room interaction system,in: Proceedings of the 6th international conference on Multimodal, interfaces,2004, pp. 159–166.

Yanagihara, Y., Hama, H., 2000. System for selecting and generating imagescontrolled by eye movements applicable to CT image display. Med. ImagingTechnol. 18 (5), 725–734.

Wachs, J.P., Stern, H.I., Edan, Y., Gillam, M., Handler, J., Feied, C., Smith, M., 2008. Agesture-based tool for sterile browsing of radiology images. J. Am. Med. Inf.Assoc. 15 (3), 321–323.

Jacob, M., Cange, C., Packer, R., Wachs, J.P., 2012. Intention, context and gesturerecognition for sterile MRI navigation in the operating room. In: Alvarez, L.,Mejail, M., Gomez, L., Jacobo, J. (Eds.), Progress in Pattern Recognition, ImageAnalysis, Computer Vision, and Applications. Springer Berlin Heidelberg, pp.220–227.

Shreiner, D., Woo, M., Neider, J., Davis, T., 2005. OpenGL(R) Programming Guide: TheOfficial Guide to Learning OpenGL(R), fiveth ed. Addison-Wesley Professional,Opengl, Version 2.

G. Bradski, {The OpenCV Library}, Dr. Dobb’s Journal of Software Tools, 2000.


Rosset, A., Spadola, L., Ratib, O., 2004. OsiriX: an open-source software fornavigating in multidimensional DICOM images. J. Digit. Imaging 17 (3), 205–216.

‘‘Kinect – Xbox.com.’’ [Online]. Available: <http://www.xbox.com/en-US/kinect>.[Accessed: 19-Jan-2012].

OpenNI, OpenNI User Guide.Jacob, M., Cange, C., Packer, R., Wachs, J., 2012. Intention, context and gesture

recognition for sterile MRI navigation in the operating room. In: Alvarez, L.,Mejail, M., Gomez, L., Jacobo, J. (Eds.), Progress in Pattern Recognition, ImageAnalysis, Computer Vision, and Applications, 7441. Springer Berlin/Heidelberg,pp. 220–227.

Emery, N., 2000. The eyes have it: the neuroethology, function and evolution ofsocial gaze. Neurosci. Biobehav. Rev. 24 (6), 581–604.

Langton, S.R.H., 2000. The mutual influence of gaze and head orientation in theanalysis of social attention direction. Q. J. Exp. Psychol.: Sect. A 53 (3), 825–845.

Viola, P., Jones, M.J., 2004. Robust real-time face detection. Int. J. Comput. Vision 57(2), 137–154.

Lee, H.-K., Kim, J.H., 1999. An HMM-based threshold model approach for gesturerecognition. IEEE Trans. Pattern Anal. Mach. Intell. 21 (10), 961–973.

M. Elmezain, A. Al-Hamadi, B. Michaelis, Hand trajectory-based gesture spottingand recognition using HMM, in: 2009 16th IEEE International Conference onImage Processing (ICIP), 2009, pp. 3577–3580.

Bernardin, K., Ogawara, K., Ikeuchi, K., Dillmann, R., 2005. A sensor fusion approachfor recognizing continuous human grasping sequences using hidden Markovmodels. IEEE Trans. Rob. 21 (1), 47–57.

Junker, H., Amft, O., Lukowicz, P., Tröster, G., 2008. Gesture spotting with body-worn inertial sensors to detect user activities. Pattern Recogn. 41 (6), 2010–2024.

M.R. Malgireddy, J.J. Corso, S. Setlur, V. Govindaraju, D. Mandalapu, A framework forhand gesture recognition and spotting using sub-gesture Modelin, in: 2010 20thInternational Conference on Pattern Recognition (ICPR), 2010, pp. 3780–3783.

M.R. Malgireddy, I. Inwogu, V. Govindaraju, A temporal Bayesian model forclassifying, detecting and localizing activities in video sequences, in: 2012IEEE Computer Society Conference on Computer Vision and Pattern RecognitionWorkshops (CVPR Workshops), 16–21 June 2012, 2012, p. 6.

Rabiner, L.R., 1989. A tutorial on hidden Markov models and selected applications inspeech recognition. Proc. IEEE 77 (2), 257–286.

D. Kelly, J. McDonald, C. Markham, Continuous recognition of motion based gesturesin sign language, in: 2009 IEEE 12th International Conference on ComputerVision Workshops (ICCV Workshops), 2009, pp. 1073–1080.

J. Lofberg, YALMIP: a toolbox for modeling and optimization in MATLAB, in: 2004IEEE International Symposium on Computer Aided Control Systems Design,2004, pp. 284–289.

C. Ansótegui, M. L. Bonet, J. Levy, F. Manyà, The Logic behind Weighted CSP, 2007.

































http://www.xbox.com/en-US/kinect























Date post:	18-Dec-2016
Category:	Documents
Upload:	juan-pablo
View:	216 times
Download:	3 times

Context-based hand gesture recognition for the operating room

Documents