+ All Categories
Home > Documents > MobiGesture: Mobility-Aware Hand Gesture Recognition for...

MobiGesture: Mobility-Aware Hand Gesture Recognition for...

Date post: 13-Aug-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
10
MobiGesture: Mobility-Aware Hand Gesture Recognition for Healthcare Hongyang Zhao Yongsen Ma Shuangquan Wang Amanda Watson Gang Zhou Computer Science Department, College of William and Mary Email: {hyzhao, yma}@cs.wm.edu, {swang10, aawatson}@email.wm.edu, [email protected] Abstract—Accurate recognition of hand gestures while moving is still a significant challenge, which prevents the wide use of existing gesture recognition technology. In this paper, we propose a novel mobility-aware hand gesture segmentation algorithm to detect and segment hand gestures. We also propose a Convo- lutional Neural Network (CNN) to classify hand gestures with mobility noises. Based on the segmentation and classification algo- rithms, we develop MobiGesture, a mobility-aware hand gesture recognition system for healthcare. For the leave-one-subject-out cross-validation test, experiments with human subjects show that the proposed segmentation algorithm achieves 94.0% precision, and 91.2% recall when the user is moving. The proposed hand gesture classification algorithm is 16.1%, 15.3%, and 14.4% more accurate than state-of-the-art work when the user is standing, walking and jogging, respectively. Index Terms—Mobility-Aware, Gesture Recognition, Gesture Segmentation, Convolutional Neural Network I. I NTRODUCTION Regular mobility, such as walking or jogging, is one of the most effective ways to promote health and well-being. It helps improve overall health and reduces the risk of many health problems, such as diabetes, cardiovascular disease, and osteoarthritis [1] [2] [3]. In addition, regular mobility can also improve depression, cognitive function, vision problems, and lower-body function [4] [5]. According to the evidence- based Physical Activity Guidelines released by World Health Organization [6] and the U.S. government [7], adults aged 18- 64 should do at least 150 minutes of moderate-intensity or 75 minutes of vigorous-intensity aerobic activity per week, or an equivalent combination of both to keep healthy. Nowadays, lots of people like to listen to music on their smartphones while they do aerobic activity, such as walking or jogging. Many smartphone apps track users’ workouts, play music, and even match the tempo of the songs to users’ paces, such as Nike+ Run Club [8], RunKeeper [9], MapMyRun [10]. However, it is inconvenient for the users to interact with these apps while walking or jogging. To change the music, users need to slow down, take out the smartphone, and then change the music. This is troublesome. Instead, it is more convenient for users to use gestures to control the music. Unlike traditional touchscreen-based interaction, hand gestures can simplify the interaction with a smartphone by reducing the need to take out the smartphone and slow down the pace. When a user is moving, gesture recognition is difficult. The first reason is that hand swinging motions during walking or jogging are mixed with the hand gestures. It is hard to classify if a hand movement comes from hand swinging motions or a hand gesture. In addition, when the user performs a hand gesture while moving, the hand movement is a combination of the hand gesture and the body movement. The mobility noise caused by the body movement reduces the accuracy of the hand gesture recognition. Therefore, it is hard to recognize hand gestures when the user is moving. To solve the gesture recognition problem when the user is walking or jogging, two research questions need to be answered: (1) How to segment the hand gestures when the user is moving? (2) How to accurately classify the hand gestures with mobility noises? In order to answer the first research question, we first apply an AdaBoost Classifier to classify the current body movement into moving or non-moving. If the user is not moving, we apply a threshold-based segmentation algorithm to segment the hand gestures. If the user is moving, the sensor readings are periodic and self-correlated. So, we propose a novel self- correlation metric to evaluate the self-correlation of the sensor readings. If the sensor readings are not self-correlated at the moving frequency, we regard it as a potential gesture sample. Then, a moving segmentation algorithm is applied to segment the hand gestures. In order to answer the second research question, we design a CNN model to classify the hand gestures with mobility noises. We apply a batch normalization layer, a dropout layer, a max- pooling layer and L2 regularization to overcome overfitting and handle mobility noises. In addition, we integrate the gesture segmentation and clas- sification algorithms into a system called, MobiGesture. For the leave-one-subject-out cross-validation test, experiments with human subjects show that the proposed segmentation algorithm accurately segments the hand gestures with 94.% precision and 91.2% recall when a user is moving. The pro- posed hand gesture classification algorithm is 16.1%, 15.3%, and 14.4% more accurate than state-of-the-art work when a user is standing, walking and jogging, respectively. As far as we know, two efforts have been made to study the gesture recognition problem when a user is moving. Park et al. [11] propose a Multi-situation HMM architecture. They train a HMM model for each pair of hand gesture and mobility situation. As the authors define 8 hand gestures and 4 mobility situations, 32 HMM models are trained in total. Given a hand gesture, they apply the Viterbi algorithm [12] to calculate the likelihood of each HMM model. The HMM model with the highest likelihood is selected as the classified gesture. As the
Transcript
Page 1: MobiGesture: Mobility-Aware Hand Gesture Recognition for ...gzhou.blogs.wm.edu/files/2018/09/CHASE18a.pdf · Many smartphone apps track users’ workouts, play ... gesture while moving,

MobiGesture: Mobility-Aware Hand GestureRecognition for Healthcare

Hongyang Zhao Yongsen Ma Shuangquan Wang Amanda Watson Gang ZhouComputer Science Department, College of William and Mary

Email: hyzhao, [email protected], swang10, [email protected], [email protected]

Abstract—Accurate recognition of hand gestures while movingis still a significant challenge, which prevents the wide use ofexisting gesture recognition technology. In this paper, we proposea novel mobility-aware hand gesture segmentation algorithm todetect and segment hand gestures. We also propose a Convo-lutional Neural Network (CNN) to classify hand gestures withmobility noises. Based on the segmentation and classification algo-rithms, we develop MobiGesture, a mobility-aware hand gesturerecognition system for healthcare. For the leave-one-subject-outcross-validation test, experiments with human subjects show thatthe proposed segmentation algorithm achieves 94.0% precision,and 91.2% recall when the user is moving. The proposed handgesture classification algorithm is 16.1%, 15.3%, and 14.4% moreaccurate than state-of-the-art work when the user is standing,walking and jogging, respectively.

Index Terms—Mobility-Aware, Gesture Recognition, GestureSegmentation, Convolutional Neural Network

I. INTRODUCTION

Regular mobility, such as walking or jogging, is one ofthe most effective ways to promote health and well-being. Ithelps improve overall health and reduces the risk of manyhealth problems, such as diabetes, cardiovascular disease, andosteoarthritis [1] [2] [3]. In addition, regular mobility canalso improve depression, cognitive function, vision problems,and lower-body function [4] [5]. According to the evidence-based Physical Activity Guidelines released by World HealthOrganization [6] and the U.S. government [7], adults aged 18-64 should do at least 150 minutes of moderate-intensity or 75minutes of vigorous-intensity aerobic activity per week, or anequivalent combination of both to keep healthy.

Nowadays, lots of people like to listen to music on theirsmartphones while they do aerobic activity, such as walkingor jogging. Many smartphone apps track users’ workouts, playmusic, and even match the tempo of the songs to users’ paces,such as Nike+ Run Club [8], RunKeeper [9], MapMyRun [10].However, it is inconvenient for the users to interact withthese apps while walking or jogging. To change the music,users need to slow down, take out the smartphone, and thenchange the music. This is troublesome. Instead, it is moreconvenient for users to use gestures to control the music.Unlike traditional touchscreen-based interaction, hand gesturescan simplify the interaction with a smartphone by reducing theneed to take out the smartphone and slow down the pace.

When a user is moving, gesture recognition is difficult. Thefirst reason is that hand swinging motions during walking orjogging are mixed with the hand gestures. It is hard to classify

if a hand movement comes from hand swinging motions ora hand gesture. In addition, when the user performs a handgesture while moving, the hand movement is a combinationof the hand gesture and the body movement. The mobilitynoise caused by the body movement reduces the accuracy ofthe hand gesture recognition. Therefore, it is hard to recognizehand gestures when the user is moving. To solve the gesturerecognition problem when the user is walking or jogging, tworesearch questions need to be answered: (1) How to segmentthe hand gestures when the user is moving? (2) How toaccurately classify the hand gestures with mobility noises?

In order to answer the first research question, we first applyan AdaBoost Classifier to classify the current body movementinto moving or non-moving. If the user is not moving, weapply a threshold-based segmentation algorithm to segmentthe hand gestures. If the user is moving, the sensor readingsare periodic and self-correlated. So, we propose a novel self-correlation metric to evaluate the self-correlation of the sensorreadings. If the sensor readings are not self-correlated at themoving frequency, we regard it as a potential gesture sample.Then, a moving segmentation algorithm is applied to segmentthe hand gestures.

In order to answer the second research question, we design aCNN model to classify the hand gestures with mobility noises.We apply a batch normalization layer, a dropout layer, a max-pooling layer and L2 regularization to overcome overfittingand handle mobility noises.

In addition, we integrate the gesture segmentation and clas-sification algorithms into a system called, MobiGesture. Forthe leave-one-subject-out cross-validation test, experimentswith human subjects show that the proposed segmentationalgorithm accurately segments the hand gestures with 94.%precision and 91.2% recall when a user is moving. The pro-posed hand gesture classification algorithm is 16.1%, 15.3%,and 14.4% more accurate than state-of-the-art work when auser is standing, walking and jogging, respectively.

As far as we know, two efforts have been made to studythe gesture recognition problem when a user is moving. Parket al. [11] propose a Multi-situation HMM architecture. Theytrain a HMM model for each pair of hand gesture and mobilitysituation. As the authors define 8 hand gestures and 4 mobilitysituations, 32 HMM models are trained in total. Given a handgesture, they apply the Viterbi algorithm [12] to calculate thelikelihood of each HMM model. The HMM model with thehighest likelihood is selected as the classified gesture. As the

Page 2: MobiGesture: Mobility-Aware Hand Gesture Recognition for ...gzhou.blogs.wm.edu/files/2018/09/CHASE18a.pdf · Many smartphone apps track users’ workouts, play ... gesture while moving,

number of the hand gestures and/or the number of the mo-bility situations increases, their computational cost increasesdramatically. Different from their work, we only train oneCNN model, which consumes much less computational powerand time. In addition, evaluation results show that our CNNmodel performs better than Multi-situation HMM on gestureclassification under leave-one-subject-out cross-validation test.The second effort comes from Murao et al. [13]. They proposea combined-activity recognition system. This system firstclassifies user movement into one of three categories: postures(e.g., sitting), behaviors (e.g., walking), and gestures (e.g., apunch). Then, Dynamic Time Warping (DTW) is applied torecognize hand gestures for the specific category. However,their system requires five sensors to be attached to the humanbody for gesture recognition. Instead, we only use one sensor,and hence are less intrusive.

We summarize our contributions as follows:1) We propose a novel mobility-aware gesture segmenta-

tion algorithm to detect and segment hand gestures.2) We design a CNN model to classify hand gestures.

This CNN model conquers mobility noises and avoidsoverfitting.

3) We integrate the gesture segmentation and classificationalgorithms into a system, MobiGesture. Our experimentsresults show that the proposed segmentation algorithmachieves 94.0% precision and 91.2% recall when theuser is moving. The proposed hand gesture classificationalgorithm is 16.1%, 15.3%, and 14.4% more accuratethan state-of-the-art work when the user is standing,walking and jogging, respectively.

The remainder of this paper is organized as follows. First,we present the motivation in Section II. Then, we introducethe system architecture in Section III. We present our mobility-aware segmentation algorithm in Section IV, and CNN modelin Section V. In Section VI, we evaluate the system per-formance. We summarize the related works in Section VII.Finally, we draw our conclusion in Section VIII.

II. MOTIVATION

Hand gestures can help users interact with various mobileapplications on smartphones in mobile situations. One com-mon scenario is to control a music app while walking orjogging. We define our hand gestures to be suitable for musiccontrol in Section II-A. Based on these defined gestures, weintroduce the challenge of gesture recognition when the useris walking or jogging in Section II-B. Finally, we present ourdata collection and the data set in Section II-C. This data set isused for performance evaluation during the rest of the paper.

A. Gesture Definition

There has been substantial research on gesture recognition.Some works define gestures according to application scenarios,such as gestures in daily life [14], or repetitive motionsin very specific activities [15], while others define gesturescasually [11]. In this paper, we carefully define the handgestures that are suitable for controlling a music app. Typically,

1 2 3 4 5

6 7 8 9 0

Left Right

Down

Up

Back & Forth

ClockwiseCounterclockwise

Fig. 1. 17 defined gestures for remote control

0 1 2 3 4 5 6 7Time(s)

-40

-20

0

20

40

Acceleration(m2 /sec)

xyz

put down handraise handjogging

Right gesture

Fig. 2. Accelerometer readings of a Right gesture when a user is jogging

a music app provides the following functions to control themusic: next track, previous track, volume up, volume down,play/pause, repeat on/off, shuffle on/off. Therefore, we defineseven gestures corresponding to these seven functions. At thebeginning, the user extends his/her hand in front of his/herbody. Then he/she moves towards a certain direction andmoves back to the starting point again.

These seven gestures are illustrated in Fig. 1 and are definedas follows: (1) Left gesture: move left and then move back tothe starting point; (2) Right gesture: move right and then moveback to the starting point; (3) Up gesture: move up and thenmove back to the starting point; (4) Down gesture: move downand then move back to the starting point; (5) Back&Forthgesture: move to shoulder and then extend again to the startingpoint; (6) Clockwise gesture: draw a clockwise circle; (7)Counterclockwise gesture: draw an counterclockwise circle.In addition, we define 10 number gestures as shown in Fig. 1to select a specific song in the music app.

B. Challenge of Gesture Recognition under Mobility

To recognize hand gestures, a typical gesture processingpipeline consists of three steps: (1) detect a hand gesture froma sequence of hand movements; (2) segment the hand gesture;(3) classify the segmented hand gesture. When it comes toa mobile situation, the noises caused by body movementspresent several practical challenges for these three steps.

First, it is hard to detect gestures while a user is moving.While the user is standing without performing any gesture, the

Page 3: MobiGesture: Mobility-Aware Hand Gesture Recognition for ...gzhou.blogs.wm.edu/files/2018/09/CHASE18a.pdf · Many smartphone apps track users’ workouts, play ... gesture while moving,

TABLE ICHARACTERISTICS OF FIVE PARTICIPANTS

Human Subject No. Gender Age Height(cm) Weight(kg)1 male 29 174 622 female 27 167 553 male 28 180 734 male 39 170 875 male 30 171 68

accelerometer readings keep stable. When the user performs agesture such as the Right gesture, the accelerometer readingschange dramatically. Therefore, it is easy to detect a gesture bymeasuring the amplitude or deviation of the sensor readings.However, when it comes to a jogging scenario, a Right gestureand hand swinging motions are mixed together as shown inFig. 2. Therefore, it is hard to tell whether a hand movementcomes from the hand swinging motions or a hand gesture.

The second challenge is that it is hard to segment handgestures while the user is moving. To perform a hand gesturewhile walking or jogging, the user needs to raise his/her hand,perform a gesture, and then put down his/her hand. To segmenthand gestures while the user is moving, we need to not onlyfilter out hand swinging motions caused by body movements,but also accurately exclude the hand-raising and hand-loweringmovements. If the starting point and end point of a handgesture is not precisely determined, it is hard to classify handgestures accurately. As shown in Fig. 2, it is difficult to find thestarting point and end point of a Right gesture while jogging.

The third challenge is that it is hard to classify handgestures when a user is moving. After gesture segmentation, asegmented hand gesture sample includes not only the gesturethe user performs, but also the noises caused by the bodymovements. Additionally, when the user performs a handgesture while walking or jogging, (s)he needs to keep thewalking/jogging pace while performing this gesture. The effortto keep the moving pace influences the shape of the handgesture that the user performs. Therefore, the hand gestureperformed when the user is standing is slightly different fromthe same type of hand gesture performed when the user iswalking or jogging. Both the mobility noises and the gesturedifferences reduce the accuracy of gesture classification.

C. Dataset

We used a UG wristband [16] to collect 17 hand gesturesfrom 5 human subjects, which is shown in Fig. 3. The UGwristband sampled the accelerometer and gyroscope readingsat 50 Hz. The data collection experiment contained threeindependent steps. (1) Each participant performed each gesture10 times while standing. (2) Each participant performed eachgesture 10 times while walking on a treadmill. (3) Eachparticipant performed each gesture 10 times while jogging ona treadmill. In total, 2550 hand gestures were collected. Whilewalking or jogging on a treadmill, different participants tendedto walk or jog at different speeds. In our experiment, the speedof walking ranged from 2 miles/hour to 3 miles/hour, and thespeed of jogging ranged from 4 miles/hour to 6 miles/hour. Wetook video of each participant as they completed these tasks to

serve as ground truth. The characteristics of our participantsare shown in Table I.

III. SYSTEM ARCHITECTURE

The system architecture of the MobiGesture is shown inFig. 4. We apply a novel mobility-aware segmentation moduleto partition the raw accelerometer and gyroscope readingsinto segments so that each segment contains one completehand gesture. In the mobility-aware segmentation module, wefirst detect whether or not the user is moving. We extract aseries of time-domain and frequency-domain features fromaccelerometer readings and apply an AdaBoost Classifier toclassify the current body movement into moving or non-moving. If the user is not moving, sensor readings are cleanand do not contain any mobility noise. In this case, we applya simple threshold-based segmentation algorithm to segmentthe hand gestures.

On the other hand, if the user is walking or jogging, thesensor readings are periodic and self-correlated. We performFast Fourier Transform (FFT) on accelerometer readings andcompute the dominant frequency, which is the frequencyof walking or jogging. Based on the dominant frequency,we propose a novel self-correlation metric, SC. This metricrepresents the self-correlation characteristics of accelerometerreadings at the given frequency. When the user is walkingor jogging, the sensor readings are self-correlated at thedominant frequency. Once the sensor readings are no longerself-correlated at the dominant frequency, we regard it as apotential gesture sample. Then, a moving segmentation algo-rithm is applied to partition the accelerometer and gyroscopereadings into segments based on SC metric.

As the 17 predefined gestures are different from each otherand users tend to perform the gestures at different speeds, theduration of each gesture is different. Therefore, the size of eachsegment is different. We apply a Cubic Spline Interpolationalgorithm [17] to rescale the size of each segment so thateach segment contains the same data points. Finally, we designa 9-layer Convolutional Neural Network to recognize handgestures. The Convolutional Neural Network is designed tobe anti-overfitting and robust to mobility noises.

IV. MOBILITY-AWARE SEGMENTATION

A simple way to segment hand gestures from a sequence ofhand movements is to use a hand-controlled button to clearlyindicate the starting point and the end point of each individualgesture. However, in order to do so, the user must wear anexternal button on their fingers or hold it in their hands,which is obtrusive and burdensome. Another way is to segmentgestures automatically. The motion data are automaticallypartitioned into non-overlapping, meaningful segments, suchthat each segment contains one complete gesture. Automaticsegmentation when a user is moving faces a few challenges.First, when the user is moving, the hand gestures are mixedwith the mobility noises, which leads to inaccurate segmen-tation. In addition, the segmentation should extract the handmovement caused by the hand gestures rather than the hand

Page 4: MobiGesture: Mobility-Aware Hand Gesture Recognition for ...gzhou.blogs.wm.edu/files/2018/09/CHASE18a.pdf · Many smartphone apps track users’ workouts, play ... gesture while moving,

(a) Placement of a UG wristband

(b) Coordinate System of a UG wristband

X

YZ

XY

Z

(a) Placement of a UGwristband(a) Placement of a UG wristband

(b) Coordinate System of a UG wristband

X

YZ

XY

Z

(b) Coordinate System of aUG wristband

Fig. 3. UG Wristband

Self-Correlation Analysis

MovingSegmentation

Accelerometer

Gyroscope

rawacc

SCcurve

segmentedAcc/Gyro

Non-movingSegmentation

FeatureExtraction

MobilityClassification

dominantfrequency

moving

non-movingrawgyro

Mobility-awareSegmentation

featuresGesture

Data Scaling

Convolutional Neural Network

Scaled data

Deep LearningClassification

Fig. 4. System Architecture

movement caused by the body movement. Otherwise, theextracted segments contain non-gesture noises, or miss usefulgesture information, which leads to inaccurate classification.To deal with these challenges, we propose a mobility-awaresegmentation algorithm. we first classify the body movementinto non-moving or moving. Then, we propose a non-movingsegmentation algorithm and a moving segmentation algorithmto segment hand gestures for two different moving scenarios.

A. Feature Extraction

It is difficult to accurately detect the mobility situationsolely based on a wristband. The reason is that the sensorsin the wristband measure the combination of hand motion,gravity, and body movement. In order to accurately detect ifthe user is moving or not, additional sensors that are tightlyattached on the body are required. However, this requirementis intrusive.

Instead of attaching additional sensors on the body, we inferthe body movement based on the sensor readings from thewristband. When the user is walking, the hands are pointingto the ground with the palm facing towards the user. When theuser is jogging, the hands are pointing forward with the palmfacing towards the user. The orientation of the hand is fixedand stable during walking or jogging. The sporadic occurrenceof a hand gesture influences the orientation of the hand in ashort time. However, the orientation of the hand is stable formost of the time during walking or jogging. This motivates usto use the orientation of the hand to infer the body movement.In addition, when a user is walking or jogging, the user swingshis/her hands periodically. The sporadic occurrence of a handgesture does not influence the dominant frequency of walking

or jogging. This motivates us to use the frequency of handswinging motions to infer the body movement.

We apply a sliding window with an overlapping of 50%for the accelerometer readings. The window size is 5 seconds.We compute a series of time-domain and frequency-domainfeatures for each time window.

For the time-domain features, we compute the mean ofthe accelerometer readings of the X-axis, Y-axis, and Z-axisaccordingly, the mean of the pitch, and the mean of the rollto represent the orientation of the hand for each time window.The pitch and roll are computed as

Pitch = arctan

(Accy√

(Accx)2 + (Accz)2

), (1)

Roll = − arctan

(AccxAccz

), (2)

where Accx, Accy, Accz are the accelerometer readings of theX-axis, Y-axis, and Z-axis for each time window.

For the frequency domain, we first compute the amplitudeof accelerometer readings as

Acc =√

(Accx)2 + (Accy)2 + (Accz)2 , (3)

where Accx, Accy, Accz are the accelerometer readings of theX-axis, Y-axis, and Z-axis for each time window. Then, weperform Fast Fourier transform (FFT) for all the amplitude ofthe accelerometer readings within each time window. We findthe dominant frequency, which has the largest amplitude inthe frequency domain. Finally, the dominant frequency and theamplitude of the dominant frequency are chosen as frequency-domain features.

Page 5: MobiGesture: Mobility-Aware Hand Gesture Recognition for ...gzhou.blogs.wm.edu/files/2018/09/CHASE18a.pdf · Many smartphone apps track users’ workouts, play ... gesture while moving,

TABLE IICOMPARISON OF MACHINE LEARNING ALGORITHMS FOR MOBILITY

CLASSIFICATION

Test Algorithm Precision Recall F-MeasureAdaBoost 93.5% 93.5% 93.5%

Naive Bayes 91.6% 91.4% 91.5%5-fold SVM 93.5% 93.5% 93.3%

J48 96.0% 96.0% 96.0%RandomForest 96.9% 96.9% 96.9%

AdaBoost 94.9% 94.6% 94.4%Naive Bayes 93.6% 91.4% 91.5%

LOSO SVM 93.6% 92.5% 92.2%J48 91.0% 88.7% 89.0%

RandomForest 93.7% 92.2% 92.2%

B. Mobility Classification

We apply the WEKA machine-learning suite [18] to trainfive commonly used classifiers. The classifiers include Ad-aBoost (run for 100 iterations), Naive Bayes, SVM (with poly-nomial kernels), J48 (equivalent to C4.5 [19]), and RandomForests (100 trees, 4 random features each). To evaluate theperformance of the proposed algorithms, we apply two tests: 5-fold cross-validation and leave-one-subject-out (LOSO) cross-validation. The 5-fold cross-validation test uses all the gesturedata to form the dataset. It partitions the dataset into 5randomly chosen subsets of equal size. Four subsets are usedto train the model. The remaining one is used to validate themodel. This process is repeated 5 times such that each subsetis used exactly once for validation. The leave-one-subject-outcross-validation test uses the gesture data from four subjectsto train the classification model, and then applies this model totest the gesture samples from the remaining subject. Precision,recall, and F-measure are considered as the evaluation metrics.

The classification results for these five algorithms are shownin Table II. Under the 5-fold cross-validation test, Random-Forest performs the best. The precision, recall, and F-measureare 96.9%, 96.9%, and 96.9%, respectively. Under the leave-one-subject-out cross-validation test, AdaBoost performs thebest. The precision, recall, and F-measure are 94.9%, 94.6%,and 94.4%, respectively. We favor the leave-one-subject-outcross-validation test over the 5-fold cross-validation test toavoid overfitting. Therefore, we choose AdaBoost classifierto classify the body movement.

C. Non-Moving Segmentation

When the user is not moving, we apply a lightweightthreshold-based detection method to identify the starting andend points of the hand gestures. To characterize a user’s handmovement (HM), a detection metric is defined using thegyroscope sensor readings as

HM =√Gyro2x +Gyro2y +Gyro2z, (4)

where Gyrox, Gyroy, Gyroz are the gyroscope readings ofthe X-axis, Y-axis, and Z-axis. When the user’s hand isstationary, the HM is very close to zero. The faster a handmoves, the larger the HM is. When the HM is larger thana threshold, i.e. 50 degree/second, we regard it as the startingpoint of a hand movement. Once the HM is smaller than

0 1 2 3 4 5 6 7 8 9 10Time(s)

-20

-10

0

10

20

Acce

lera

tion(

m2 /s

ec)

axayaz

0 1 2 3 4 5 6 7 8 9 10Time(s)

-15

-10

-5

0

5

10

15

SC(m

2 /sec

2 )

SC

0 1 2 3 4 5 6 7 8 9 10Time(s)

-20

-10

0

10

20

Acce

lera

tion(

m2 /s

ec)

axayaz

(a) Acceleration data of a LEFT gesture while walking

(b) SC curve

t t+Tt-T

(c) Segmented acceleration data

segment

valley

peak

put down handraise handwalking gesture

(a) Acceleration data of a Left gesture while walking

0 1 2 3 4 5 6 7 8 9 10Time(s)

-20

-10

0

10

20

Acce

lera

tion(

m2 /s

ec)

axayaz

0 1 2 3 4 5 6 7 8 9 10Time(s)

-15

-10

-5

0

5

10

15

SC(m

2 /sec

2 )

SC

0 1 2 3 4 5 6 7 8 9 10Time(s)

-20

-10

0

10

20

Acce

lera

tion(

m2 /s

ec)

axayaz

(a) Acceleration data of a LEFT gesture while walking

(b) SC curve

t t+Tt-T

(c) Segmented acceleration data

segment

valley

peak

put down handraise handwalking gesture

(b) SC curve

Fig. 5. Segmentation when a user is moving

this threshold for a certain period of time, i.e. 200 ms, weregard it as the end point of the hand movement. The timethreshold is necessary. Without it, the HM may fall belowthis threshold occasionally, leading to unexpected splitting ofthis gesture [20] [21]. As a gesture does not last shorter than260 ms or longer than 2.7 seconds in our dataset, we drop asegment if the length of this segment is shorter than 260 msor longer than 2.7 seconds.

D. Self-Correlation Analysis

When the user is walking or jogging, the sensor readingsare periodic and self-correlated at the frequency of walkingor jogging. Once the user performs a gesture while walkingor jogging, the sensor readings are neither periodic nor self-correlated. Based on this observation, we propose a novel self-correlation metric SC to measure the self-correlation of theaccelerometer readings as

SC(t) =∑

i∈x,y,z

T∑j=1

[Acci (t+ j)−Acci (t+ j − T − 1)] /T,

(5)where Acci (i ∈ x, y, z) are the accelerometer readingsof the X-axis, Y-axis, and Z-axis. T is the cycle of thewalking or jogging, which is computed as the inverse of thedominant frequency. t is the current time. If the accelerometerreadings are self-correlated at the dominant frequency, the SCis very close to zero. If the accelerometer readings are notself-correlated at the dominant frequency, the SC is either alarge positive value or a large negative value. Fig. 5(a) showsthe accelerometer readings of a Left gesture when a user iswalking. The computed SC curve is in Fig. 5(b).

From Fig. 5(a) and (b), we find that the SC is very close tozero when the user swings his/her hand during walking. The

Page 6: MobiGesture: Mobility-Aware Hand Gesture Recognition for ...gzhou.blogs.wm.edu/files/2018/09/CHASE18a.pdf · Many smartphone apps track users’ workouts, play ... gesture while moving,

SC begins to increase when the user raises his/her hand. Thepeak of the SC occurs when the user finishes raising his/herhand and begins to perform a Left gesture. The valley of theSC occurs when the user finishes performing a Left gestureand begins to put down his/her hand. After the user puts downthe hand and continues to swing the hand, the SC goes backto zero. We find that the peak and the valley of the SC curveare good indicators of the starting point and the end point ofthe Left gesture. The reason is that when the user raises his/herhand or puts down his/her hand, the orientation of his/her handchanges a lot. This change greatly increases or reduces the SCmetric.

When the user is walking, the hand is pointing to the ground.Impacted by the force of gravity, the accelerometer readingsof the Y-axis are always negative. When the user is jogging,the hand is pointing forward with the palm facing towardsthe user. In this case, the accelerometer readings of the X-axis are impacted by gravity and always have negative values.However, when the user raises his/her hand and performs thegesture, the palm faces the ground. The accelerometer readingsof the Z axis are impacted by gravity and have positive values.Therefore, when the user is walking or jogging, the sum of theaccelerometer readings in 3 axes are always negative. Whenthe user raises his/her hand, the sum of the accelerometerreadings increases and reaches the maximum right after raisinghand. As the SC is computed by the difference of theaccelerometer readings in two adjacent time window (windowsize is the cycle of walking or jogging), SC reaches the peakright after raising hand. Similarly, SC reaches the valley rightafter putting down hand. Therefore, we use the peak and thevalley of the SC curve as the starting point and the end pointof the Left gesture.

E. Moving Segmentation

We segment the hand gestures when the user is moving bysearching for the peak and valley of the SC. We compute theSC metric from the accelerometer readings. If the SC valueis larger than 5 m2/sec2, we start to search for the peak of theSC within a 4 second time window. Once the peak is found,we regard it as the starting point of the hand gesture, and beginto search for the valley of the SC. We define the valley of theSC to be the smallest SC within a 4 second time window andis lower than a threshold, -5 m2/sec2. Once the SC valley isfound, we regard it as the end point of the hand gesture. Weextract the accelerometer and gyroscope readings between thestarting point and the end point as the segment. As a gesturedoes not last longer than 2.7 seconds in our dataset, we drop asegment if we cannot find the valley of the SC after the peakof the SC for 2.7 seconds.

Fig. 6 and Fig. 7 show the performance of the movingsegmentation under different window sizes and different SCthresholds accordingly. Three evaluation metrics are consid-ered: precision, recall, and F-measure. As the window sizeor the SC threshold increases, the segmentation precisionincreases and the recall decreases. When the window size is4 s and the SC threshold is 5 m2/sec2, the F-measure is at

TABLE IIICOMPARISON OF THE GESTURE SEGMENTATION PERFORMANCE

Scenario Algorithm Precision Recall F-MeasureMobiGesture 5-fold 92.2% 93.5% 92.8%

Non-moving MobiGesture LOSO 92.6% 90.1% 91.3%E-gesture 98.0% 97.1% 97.5%

MobiGesture 5-fold 93.5% 94.6% 93.7%Moving MobiGesture LOSO 94.0% 91.2% 92.2%

E-gesture 8.5% 18.3% 11.3%

its highest value: 93.7%. Therefore, we choose 4 s as the timewindow size and 5 m2/sec2 as the SC threshold to segmentthe hand gestures when the user is moving.

F. Performance

We compare our gesture segmentation algorithm with E-gesture [11], which is the state-of-the-art. E-gesture segmentsthe hand gestures based on the amplitude of the gyroscopereadings. A hand gesture is triggered if the amplitude ofthe gyroscope readings is higher than 25 degree/sec. Thetriggered gesture is assumed to have ended if the amplitudeof the gyroscope readings is lower than 25 degree/sec for 400ms. Different from E-gesture, we first apply the AdaBoostclassifier to classify the body movement into moving or non-moving. Then, we apply two different segmentation algorithmsto segment the hand gestures accordingly.

We evaluate the segmentation accuracy by checking theoverlap between a segment and a hand gesture. If the middleof a hand gesture lies in a segment, this gesture is correctlysegmented by that segment. The performance of MobiGestureand E-Gesture are shown in Table III. From the table, we findthat MobiGesture performs stably in both moving scenarios.The F-measure in two moving scenarios are around 92%. E-gesture performs well when the user is not moving. However,it performs poorly when the user is moving. When the useris moving, the F-measure of E-gesture is only 11.3%. Thepossible reasons are: (1) the sensors in their wristband aredifferent from ours; (2) their predefined hand gestures aredifferent from ours.

We change the threshold of E-gesture from 25 degree/sec to250 degree/sec with a step of 25 degree/sec and evaluate theperformance of E-gesture. Fig. 8 shows the F-measure of E-gesture under different thresholds. When the threshold is 175degree/sec, the F-measure of E-gesture in moving scenario isat its highest value: 74.5%. This is still much lower than theaccuracy of our moving segmentation algorithm: 93.7% under5-folder cross-validation test, and 92.2% under leave-one-subject-out cross-validation test. Therefore, their segmentationalgorithm can not accurately segment the hand gestures whenthe user is moving. Their solution is not general enough to beextended to the new gestures and hardware platform.

When a user is standing without performing any gesture,the gyroscope readings are close to zero. The amplitude ofthe gyroscope readings is a good measurement to segmentthe hand gestures. Both E-gesture and MobiGesture use theamplitude of the gyroscope readings to segment the hand ges-tures. Therefore, both algorithms perform well when the user

Page 7: MobiGesture: Mobility-Aware Hand Gesture Recognition for ...gzhou.blogs.wm.edu/files/2018/09/CHASE18a.pdf · Many smartphone apps track users’ workouts, play ... gesture while moving,

0.8 1.6 2.4 3.2 4 4.8 5.6 6.4 7.2 8 Window Size(s)

65

70

75

80

85

90

95

100(%

)

PrecisionRecallF-measure

Fig. 6. Segmentation precision, re-call, and F-measure under differentwindow sizes

1 2 3 4 5 6 7 8 9 10Threshold

40

50

60

70

80

90

100

(%)

PrecisionRecallF-measure

Fig. 7. Segmentation precision, re-call, and F-measure under differentSC thresholds

25 50 75 100 125 150 175 200 225 250Threshold (degree/sec)

0

20

40

60

80

100

F-m

easu

re (%

)

Non-movingMoving

Fig. 8. F-measure of E-gesture un-der different thresholds

0 1 2 3Gesture duration (s)

0

20

40

60

80

Num

ber o

f ges

ture

s

Fig. 9. The distribution of the ges-ture duration in our dataset

is standing. However, when a user is moving, the hand gesturesare mixed with the hand swinging motions. With E-gesture, theamplitude of the gyroscope readings can not differentiate thehand gestures from the hand swinging motions. Therefore, E-gesture performs poorly in the moving scenarios. Instead, weutilize the self-correlation of the sensor readings to segmentthe hand gestures, which takes the hand swinging motions intoconsideration. Therefore, the proposed segmentation algorithmaccurately distinguishes the hand gestures from the handswinging motions.

V. DEEP LEARNING CLASSIFICATION

Two approaches are popular for classifying hand gestures.One is to use conventional machine learning classifiers, suchas Naive Bayes [22], Random Forest [15], and Support VectorMachines [23]. The other is to use sequential analysis al-gorithms, such as Hidden Markov Model (HMM) [11] andDynamic Time Warping (DTW) [13].

In this paper, we use a 9-layer CNN as the classificationalgorithm. There are several advantages of the CNN over theother classifying approaches. (1) Instead of manually selectingfeatures, CNN is able to automatically learn parameters andfeatures. (2) CNN is very suitable for complex problems.Based on our study, we find that it is capable of handlingmobility noises and reducing overfitting. (3) CNN is very fastto run in the inference stage even when the number of classesis very large.

A. Data Scaling

As 17 predefined hand gestures are different from eachother and different users perform hand gestures at differentspeeds, the duration of each hand gesture is different. Fig. 9shows the distribution of the gesture duration in our dataset.The maximum gesture duration is 2.7 seconds. The minimumgesture duration is 260 ms. The average gesture duration is 1.2seconds. As the sampling rate is 50 Hz, each gesture contains60 sample points on average.

As a CNN model requires input data with the same size,we format the segment data so that each segment has thesame size. We apply the Cubic Spline Interpolation [24] torescale the number of sample points for each segment to60. As 3-axis accelerometer readings and 3-axis gyroscopereadings are collected by each sampling, 60 × 6 data points

InputSize:60×6

Input Convolution +Batch Normalization +

ReLU

KernelSize: 3KernelNum:10Stride:[1,1]

Padding: [1,1]

Max Pooling +Dropout

FullyConnected

Softmax Output

RIGHT

LEFT

UP

DOWN

BACK&FORTH

Number 10

PoolSize: 2Stride:[2,2]Padding:[0,0]

DropoutProb:0.6

OutputSize:17

Fig. 10. Architecture and parameter settings of the 9-layer CNN

are generated after interpolation. This 60 × 6 data matrix isused for classification.

B. Convolutional Neural Network

We design a 9-layer CNN as the classification algorithm.A CNN consists of an input and an output layer, as well asmultiple hidden layers. The output of the i-th layer of a n-layerneural network is given by:

y(i) = σ(i)(W (i)x(i) + b(i)

), (6)

where y(i) is the output, x(i) is the input, σ(i) is the activationfunction, W (i) is the weight matrix, and b(i) is the biasvector [25]. x(0) is the original input, which is a matrix ofthe accelerometer and gyroscope sensor data. y(n) is the finaloutput, which is one of 17 predefined hand gestures. Theoutput of the (i − 1)-th layer is the input of the i-th layer,i.e., x(i) = y(i−1).

Fig. 10 shows the architecture and parameter settings of theCNN. It includes the following 9 layers:

1) Input Layer: The input layer is the entrance to the CNN.It provides data for the following layers. After data scaling,we get a 60 × 6 data matrix. This matrix is supplied to theinput layer.

2) Convolutional Layer: The convolutional layer dividesthe input data into multiple regions. For each region, itcomputes a dot product of the weights and the input, andthen adds a bias term. A set of weights that are applied to aregion is called a kernel. The kernel moves along the input datavertically and horizontally, repeating the same computationfor each region. The step size with which it moves is called a

Page 8: MobiGesture: Mobility-Aware Hand Gesture Recognition for ...gzhou.blogs.wm.edu/files/2018/09/CHASE18a.pdf · Many smartphone apps track users’ workouts, play ... gesture while moving,

stride. We use ten 3×3 kernels and stride of 1 in both verticaland horizontal directions. To preserve the output size of theconvolutional layer, we use a padding of 1 in both vertical andhorizontal directions. It adds rows or columns of zeros to theborders of the original input.

3) Batch Normalization Layer: Batch normalization is usedto speed up network training, reduce the sensitivity to networkinitialization, and improve the generalization of the neuralnetwork when the training dataset contains data from differentusers. To take full advantage of batch normalization, we shufflethe training data after each training epoch.

4) ReLU Layer: Convolutional and batch normalizationlayers are usually followed by a nonlinear activation function.We choose a Rectified Linear Unit (ReLU) as the activationfunction. It performs a threshold operation on each input,where any input value less than zero is set to zero. ReLUis easy to compute and optimize. It provides fast and effectivetraining for deep neural networks. It has been shown moreeffective than traditional activations, such as logistic sigmoidand hyperbolic tangent, and is widely used in CNN [25].

5) Max-pooling Layer: The max-pooling layer reduces thenumber of connections to the following layers by down-sampling. It partitions the input into a set of non-overlappingrectangles. For each rectangle, it outputs the maximum. Theintuition is that the exact location of a feature is less importantthan its rough location relative to other features. The poolinglayer reduces the number of parameters to be learned in thefollowing layers, and hence reduces overfitting.

6) Dropout Layer: As a fully connected layer occupiesmost of the parameters, it is prone to overfitting. One methodto reduce overfitting is dropout. It randomly removes somenodes from a neural network with a given probability. All theincoming and outgoing edges to a dropped-out node are alsoremoved. The dropout probability in our system is 0.6.

7) Fully-connected Layer: The fully-connected layer con-nects all of its neurons to the neurons in the previous layer,i.e., the dropout layer. It combines all the features learned bythe previous layers to classify the input. The size of the outputof the fully-connected layer is equal to the number of handgesture classes, i.e., 17 in our experiments.

8) Softmax Layer: The softmax layer applies a softmax ac-tivation function to the input. The softmax activation functionnormalizes the output of the fully connected layer. The outputof the softmax layer consists of positive numbers that sum toone, which can then be used as classification probabilities bythe classification layer.

9) Classification Layer: The probabilities returned by thesoftmax activation function are the input to the classificationlayer . The classification layer assigns this input to one of the17 hand gestures, and computes the loss function.

As in many other learning systems, the parameters of a CNNmodel are optimized to minimize the loss function. We applythe Stochastic Gradient Descent with Momentum [26] to learnthe CNN parameters (weights W and biases b). It updates theparameters of the CNN by taking small steps in the direction

of the negative gradient of the loss function:

θl+1 = θl − α∇E(θl) + γ(θl+1 − θl), (7)

where θ is the parameter vector, l is the iteration index, α is thelearning rate, E(θ) is the loss function, and γ is the momentumterm [25]. The momentum term γ controls the contribution ofthe previous gradient step to the current iteration. We use amomentum term of 0.9 and a learning rate of 0.03.

Very large weights can cause the weight matrix W toget stuck in a local minimum easily since gradient descentonly makes small changes to the direction of optimization.This eventually makes it hard to explore the weight space,which leads to overfitting. To reduce overfitting, we use L2regularization, which adds an extra term into the cost functionto penalize large weights. The regularized loss function is:

ER(θ) = E(θ) + λΩ(W ), (8)

where λ is the regularization factor, and Ω(W ) = W TW /2is the regularization function. The regularization factor in oursystem is 0.03.

C. Performance

We apply both the 5-fold cross-validation and leave-one-subject-out cross-validation to evaluate the performance of ourCNN model. Accuracy is considered as the evaluation metric.It is defined as the number of correctly classified instancesdivided by the number of all testing instances. Under the 5-foldcross-validation test, the accuracy of the gesture classificationwhen the user is standing, walking, and jogging are 92.2%,90.1%, and 88.6%, respectively. The gesture classificationaccuracy when the user is jogging is only 3.6% lower thanthat when the user is standing. Therefore, we conclude that themoving scenarios do not influence the classification accuracysignificantly in our system under the 5-fold cross-validationtest.

Under the leave-one-subject-out cross-validation test, theaccuracy of the gesture classification when the user is standing,walking, and jogging are 86.6%, 84.6%, and 75.1%, respec-tively. The gesture classification accuracy when the user isjogging is 11.5% lower than that when the user is standing.In contrast to the 5-fold cross validation test, the gestureclassification is heavily influenced by the moving scenariosunder the leave-one-subject-out cross-validation test. This isreasonable as the leave-one-subject-out test brings noises fromdifferent body sizes and different ways of performing the sametype of hand gesture. The combination of these noises and mo-bility noises significantly influences the gesture classificationperformance.

We compare our gesture classification algorithm with E-gesture [11]. E-gesture proposes a Multi-situation HMMmodel for gesture classification. During training, a HMMmodel is built and trained for each pair of gesture andmobility situation. During testing, E-gesture computes theViterbi scores [12] for each of the HMM models, and thebest candidate is selected to be the classification result. Wecall this method Multi-situation HMM.

Page 9: MobiGesture: Mobility-Aware Hand Gesture Recognition for ...gzhou.blogs.wm.edu/files/2018/09/CHASE18a.pdf · Many smartphone apps track users’ workouts, play ... gesture while moving,

TABLE IVCOMPARISON OF THE GESTURE CLASSIFICATION PERFORMANCE

Test Scenarios Algorithms AccuracyStanding Multi-situation HMM 97.8%

CNN 92.2%5-fold Walking Multi-situation HMM 96.2%

CNN 90.1%Jogging Multi-situation HMM 96.5%

CNN 88.6%Standing Multi-situation HMM 70.5%

CNN 86.6%LOSO Walking Multi-situation HMM 69.3%

CNN 84.6%Jogging Multi-situation HMM 60.7%

CNN 75.1%

TABLE VCOMPARISON OF THE OVERALL PERFORMANCE

Test Scenarios Algorithms AccuracyStanding E-Gesture 95.8%

MobiGesture 85.0%5-fold Walking E-Gesture 8.2%

MobiGesture 83.1%Jogging E-Gesture 8.2%

MobiGesture 82.8%Standing E-Gesture 69.1%

MobiGesture 80.2%LOSO Walking E-Gesture 5.9%

MobiGesture 79.5%Jogging E-Gesture 5.2%

MobiGesture 70.6%

We apply the 5-fold cross-validation test and leave-one-subject-out cross-validation test to evaluate the gesture classi-fication performance. Table IV shows the gesture classificationaccuracy of these two algorithms under three different movingscenarios. Under the leave-one-subject-out cross-validationtest, CNN is 16.1%, 15.3%, and 14.4% more accurate thanMulti-situation HMM when the user is standing, walking andjogging, respectively. Under the 5-fold cross-validation test,Multi-situation HMM is roughly 7% more accurate than CNN.For Multi-situation HMM, the average accuracy is 96.8%under the 5-fold cross-validation test, and 66.8% under theleave-one-subject-out cross-validation test. There is 30% ac-curacy difference between these two tests. It shows that Multi-situation HMM model is overfitted. For CNN, the accuracydifference between these two tests is 8.2%. The reasonablysmall difference shows that overfitting is significantly reduced.

VI. PERFORMANCE EVALUATION

In this section, we first evaluate the overall performance ofMobiGesture, which integrates the aforementioned segmen-tation and CNN algorithms. We also compare MobiGesturewith state-of-the-art work. Then, we evaluate the overhead ofMobiGesture and compare it with state-of-the-art work.

A. Accuracy

The overall performance of MobiGesture and E-gesture areshown in Table V. Under the 5-fold cross-validation test, E-gesture performs well when the user is standing. However,when the user is walking or jogging, the accuracy of E-gesture is very low. The reason is that E-gesture can not

TABLE VICOMPARISON OF THE TIME CONSUMPTION

Algorithm Training Time (s) Testing Time (ms)Multi-situation HMM 89.6 40.8

CNN 56.7 13.8

differentiate the hand gestures from the hand swinging mo-tions. MobiGesture performs stably under different movingscenarios. The recognition accuracy when the user is joggingis only 2.2% lower than that when the user is standing. Underthe leave-one-subject-out cross-validation test, when the useris standing, the accuracy of E-gesture is only 69.1%. It ismuch lower than the accuracy under 5-fold cross validation:95.8%. It shows that the gesture classification model in E-gesture is overfitted. When the user is walking or running,E-gesture performs poorly again due to the low segmentationaccuracy. The accuracy of MobiGesture under the leave-one-subject-out cross-validation test is 3.6% ∼ 12.2% lower thanthat under the 5-fold cross-validation test. The reasonablysmall difference shows the effectiveness of MobiGesture’santi-overfitting design.

B. Time Delay

Table VI shows the time consumption of training andtesting of Multi-situation HMM and CNN. For the trainingtime, Multi-situation consumes 58% more time than CNN.For the testing time, Multi-situation HMM consumes roughlythree times as much as CNN. Multi-situation HMM trainsa HMM model for each pair of the hand gestures and themoving scenarios, while MobiGesture only trains one CNNmodel for all the hand gestures and moving scenarios. As thenumber of moving scenarios increases, Multi-situation HMMconsumes more time for testing, while our CNN keeps thesame. Therefore, CNN is more practical than Multi-situationHMM to be implemented for real-time classification.

VII. RELATED WORK

As far as we know, two efforts have been put forth to studythe gesture recognition problem when the user is moving.Park et al. [11] propose a gesture recognition system with ahand-worn sensor and a mobile device. To segment hand ges-tures, they design a threshold-based closed-loop collaborativesegmentation algorithm. It automatically adjusts the thresholdaccording to four mobility situations: RIDE, STAND, WALK,and RUN. To recognize hand gestures, they propose a Multi-situation HMM architecture. There are several limitations intheir system. For the gesture segmentation, their threshold-based segmentation algorithm cannot effectively differentiatethe predefined hand gestures from the hand swinging motionsin our dataset. For the gesture recognition, they train a HMMmodel for each pair of hand gesture and mobility situation.In total, 32 HMM models are trained in their system. As thenumber of the hand gestures or the number of the mobilitysituations increases, their computational cost increases dramat-ically. Different from this work, we only train one CNN model,which consumes much less computational power and time.

Page 10: MobiGesture: Mobility-Aware Hand Gesture Recognition for ...gzhou.blogs.wm.edu/files/2018/09/CHASE18a.pdf · Many smartphone apps track users’ workouts, play ... gesture while moving,

Additionally, evaluation results show that our CNN modelperforms better than Multi-situation HMM model under leave-one-subject-out cross-validation test. The second work comesfrom Murao et al. [13]. They propose a combined-activityrecognition system. This system first classifies user activityinto one of three categories: postures, behaviors, and gestures.Then DTW is applied to recognize hand gestures for thespecific category. However, their system requires five sensorsattached to the human body to recognize activity. Instead, weonly use one sensor.

Inertial sensors-based gesture recognition has been widelystudied in mobile and pervasive computing. Various ap-proaches dealing with the recognition of gestures or eventshave been presented. RisQ [15] applies motion sensors on thewristband to recognize smoking gestures. Bite Counter [27]utilizes a watch-like device with a gyroscope to detect andrecord when an individual takes a bite of food. Porzi et al. [23]propose a smart watch-based gesture recognition system forassisting people with visual impairments. Xu et al. classifyhand/finger gestures and written characters from smart watchmotion sensor data [22]. FingerPad [28], uTrack [29], andFinexus [30] use magnetic sensors to recognize finger gestures.However, none of these efforts takes body movement, such aswalking and jogging, into consideration.

VIII. CONCLUSION

In this paper, we present MobiGesture, a mobility-awaregesture recognition system for healthcare. We present a novelmobility-aware gesture segmentation algorithm to detect andsegment hand gestures. In addition, we design a CNN modelto classify the hand gestures with mobility noises. Evaluationresults show that the proposed CNN is 16.1%, 15.3%, and14.4% more accurate than state-of-the-art work when the useris standing, walking and jogging, respectively. The proposedCNN is also two times faster than state-of-the-art work.

ACKNOWLEDGMENTS

Special thanks to our participants in our user studies, and allthe anonymous reviewers. This work was supported by NSFCNS-1253506 (CAREER).

REFERENCES

[1] M. C. Ashe, W. C. Miller, J. J. Eng, and L. Noreau, “Older adults,chronic disease and leisure-time physical activity,” Gerontology, vol. 55,no. 1, pp. 64–72, 2009.

[2] I.-M. Lee and D. M. Buchner, “The importance of walking to publichealth.” Medicine and science in sports and exercise, vol. 40, no. 7Suppl, pp. S512–8, 2008.

[3] T. Prohaska, E. Belansky, B. Belza, D. Buchner, V. Marshall, K. Mc-Tigue, W. Satariano, and S. Wilcox, “Physical activity, public health,and aging: critical issues and research priorities,” The Journals ofGerontology Series B: Psychological Sciences and Social Sciences,vol. 61, no. 5, pp. S267–S273, 2006.

[4] E. M. Simonsick, J. M. Guralnik, S. Volpato, J. Balfour, and L. P.Fried, “Just get out the door! importance of walking outside the homefor maintaining mobility: findings from the women’s health and agingstudy,” Journal of the American Geriatrics Society, vol. 53, no. 2, pp.198–203, 2005.

[5] S. E. Hardy, Y. Kang, S. A. Studenski, and H. B. Degenholtz, “Ability towalk 1/4 mile predicts subsequent disability, mortality, and health carecosts,” Journal of general internal medicine, vol. 26, no. 2, pp. 130–135,2011.

[6] Global recommendations on physical activity forhealth. world health organization. [Online]. Available:http://www.who.int/dietphysicalactivity/factsheet recommendations/en

[7] Physical activity guidelines for americans. u.s. department of healthand human services. [Online]. Available: http://health.gov/paguidelines

[8] Nike+ run club app. [Online]. Available: https://www.nike.com/us/en -us/c/nike-plus/running-app-gps

[9] Runkeeper app. [Online]. Available: https://runkeeper.com/[10] Mapmyrun app. [Online]. Available: http://www.mapmyrun.com/[11] T. Park, J. Lee, I. Hwang, C. Yoo, L. Nachman, and J. Song, “E-gesture:

a collaborative architecture for energy-efficient gesture recognition withhand-worn sensor and mobile devices,” in Proceedings of the ACMSenSys. ACM, 2011, pp. 260–273.

[12] A. J. Viterbi, “Error bounds for convolutional codes and an asymptoti-cally optimum decoding algorithm,” in The Foundations Of The DigitalWireless World: Selected Works of AJ Viterbi. World Scientific, 2010,pp. 41–50.

[13] K. Murao and T. Terada, “A recognition method for combined activitieswith accelerometers,” in Proceedings of the ACM UbiComp. ACM,2014, pp. 787–796.

[14] H. Junker, O. Amft, P. Lukowicz, and G. Troster, “Gesture spotting withbody-worn inertial sensors to detect user activities,” Pattern Recognition,vol. 41, no. 6, pp. 2010–2024, 2008.

[15] A. Parate, M.-C. Chiu, C. Chadowitz, D. Ganesan, and E. Kalogerakis,“Risq: Recognizing smoking gestures with inertial sensors on a wrist-band,” in Proceedings of the ACM MobiSys. ACM, 2014, pp. 149–161.

[16] H. Zhao, S. Wang, G. Zhou, and D. Zhang, “Ultigesture: A wristband-based platform for continuous gesture control in healthcare,” SmartHealth, 2018.

[17] P. Alfeld, “A trivariate cloughtocher scheme for tetrahedral data,”Computer Aided Geometric Design, vol. 1, no. 2, pp. 169–181, 1984.

[18] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.Witten, “The weka data mining software: an update,” Proceedings of theACM SIGKDD, vol. 11, no. 1, pp. 10–18, 2009.

[19] J. R. Quinlan, C4. 5: programs for machine learning. Elsevier, 2014.[20] W.-C. Bang, W. Chang, K.-H. Kang, E.-S. Choi, A. Potanin, and D.-Y.

Kim, “Self-contained spatial input device for wearable computers,” inProceedings of the IEEE ISWC. IEEE Computer Society, 2003, p. 26.

[21] A. Y. Benbasat and J. A. Paradiso, “An inertial measurement frame-work for gesture recognition and applications,” in International GestureWorkshop. Springer, 2001, pp. 9–20.

[22] C. Xu, P. H. Pathak, and P. Mohapatra, “Finger-writing with smartwatch:A case for finger and hand gesture recognition using smartwatch,” inProceedings of the ACM HotMobile. ACM, 2015, pp. 9–14.

[23] L. Porzi, S. Messelodi, C. M. Modena, and E. Ricci, “A smart watch-based gesture recognition system for assisting people with visual im-pairments,” in Proceedings of the ACM IMMPD. ACM, 2013, pp.19–24.

[24] C. De Boor, C. De Boor, E.-U. Mathematicien, C. De Boor, andC. De Boor, A practical guide to splines. Springer, 1978, vol. 27.

[25] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.MIT press Cambridge, 2016, vol. 1.

[26] L. Bottou, “Large-scale machine learning with stochastic gradient de-scent,” in Proceedings of COMPSTAT. Springer, 2010, pp. 177–186.

[27] Y. Dong, A. Hoover, J. Scisco, and E. Muth, “A new method formeasuring meal intake in humans via automated wrist motion tracking,”Applied psychophysiology and biofeedback, vol. 37, no. 3, pp. 205–215,2012.

[28] L. Chan, R.-H. Liang, M.-C. Tsai, K.-Y. Cheng, C.-H. Su, M. Y. Chen,W.-H. Cheng, and B.-Y. Chen, “Fingerpad: private and subtle interactionusing fingertips,” in Proceedings of the ACM UIST. ACM, 2013, pp.255–260.

[29] K.-Y. Chen, K. Lyons, S. White, and S. Patel, “utrack: 3d input usingtwo magnetic sensors,” in Proceedings of the ACM UIST. ACM, 2013,pp. 237–244.

[30] K.-Y. Chen, S. N. Patel, and S. Keller, “Finexus: Tracking precisemotions of multiple fingertips using magnetic sensing,” in Proceedingsof the ACM CHI. ACM, 2016, pp. 1504–1514.


Recommended