Automatic Segmentation and Recognition in Body Sensor...

46

Automatic Segmentation and Recognition in Body Sensor NetworksUsing a Hidden Markov Model

ERIC GUENTERBERG, HASSAN GHASEMZADEH, and ROOZBEH JAFARI,University of Texas at Dallas

One important application of body sensor networks is action recognition. Action recognition often implicitlyrequires partitioning sensor data into intervals, then labeling the partitions according to the action thateach represents or as a non-action. The temporal partitioning stage is called segmentation, and the labelingis called classification. While many effective methods exist for classification, segmentation remains prob-lematic. We present a technique inspired by continuous speech recognition that combines segmentation andclassification using hidden Markov models. This technique is distributed across several sensor nodes. Weshow the results of this technique and the bandwidth savings over full data transmission.

Categories and Subject Descriptors: I.5.1 [Pattern Recognition]: Models—Statistical; I.5.3 [PatternRecognition]: Clustering—Algorithms; C.3 [Computer Systems Organization]: Special-Purpose andApplication-Based Systems—Real-time and embedded systems

General Terms: Algorithms, Experimentation, Performance

Additional Key Words and Phrases: Body sensor networks, hidden Markov models, action recognition

ACM Reference Format:Guenterberg, E., Ghasemzadeh, H., and Jafari, R. 2012. Automatic segmentation and recognition in bodysensor networks using a hidden Markov model. ACM Trans. Embed. Comput. Syst. 11, S2, Article 46(August 2012), 19 pages.DOI = 10.1145/2331147.2331156 http://doi.acm.org/10.1145/2331147.2331156

1. INTRODUCTION

The capabilities of small electronic devices have been increasing exponentially as theirsizes and prices have dropped. Uses that may have seemed frivolous or expensive arebecoming practical and even cheap. For instance, cell phones can now record videosand images and transmit them wirelessly to personal websites in real time, and carscan automatically notify paramedics of a crash. One exciting platform with similarpotential is the body sensor network (BSN) in which several intelligent sensing de-vices are placed on the human body and can perform collaborative sensing and signalprocessing for various applications.

Currently, these sensing devices are large enough that they are too cumbersomefor casual use. However, the threshold for wearability depends on the application. Forinstance, stride variability is associated with the occurrence of Alzheimer’s [Hausdorffet al. 1998; Sheridan et al. 2003]. If a patient could wear a sensor on his leg that couldhelp a doctor evaluate the effectiveness of medication in a naturalistic setting, the

Author’s address: R. Jafari, ESSP Lab, University of Texas at Dallas, 800 W. Campbell Rd, MS EC33,Richardson, TX 75080; email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrightsfor components of this work owned by others than ACM must be honored. Abstracting with credit is permit-ted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component ofthis work in other works requires prior specific permission and/or a fee. Permissions may be requested fromthe Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, USA, fax +1 (212)869-0481, or [email protected]© 2012 ACM 1539-9087/2012/08-ART46 $15.00

DOI 10.1145/2331147.2331156 http://doi.acm.org/10.1145/2331147.2331156

ACM Transactions on Embedded Computing Systems, Vol. 11, No. S2, Article 46, Publication date: August 2012.

46:2 E. Guenterberg et al.

inconvenience might be worth it. Further, these devices are getting smaller and morepowerful every year, so wearability is unlikely to remain a long-term problem.

Therefore, now is the time to investigate applications so that hardware designerscan optimize their devices for more useful applications. One use of BSNs is actionrecognition in which the actions of the person wearing the sensors are identified. Thishas several applications. For instance, techniques exist for extracting stride variabil-ity, but the output is only correct if the person is walking [Aminian et al. 2002]. Also,action recognition could be used to develop an activity log for a person to help him ordoctors assess health [Choudhury et al. 2008; Nait-Charif and McKenna 2004; Ouchiet al. 2004] or avoid dangerous actions, which might be useful for RSI sufferers. Actionrecognition could even be used to help provide contextual interfaces to other devices[Castelli et al. 2007].

Recognition is a two-step process: (1) locate temporal regions that contain a singleaction, and (2) recognize the action. Both these problems are well studied for imageand motion capture systems. However, many of these techniques are dependent onspatial information provided by these systems and are not appropriate to resource-constrained environments. This has led BSN researchers to adapt techniques frommore basic pattern-recognition approaches and from speech recognition. Much of thework in the BSN community has focused on the recognition portion, while segmenta-tion in BSN recognition is still largely an open problem. Popular approaches includesegmenting on fixed time slices, manual segmentation, and exhaustive search. Fixedtime slices may capture multiple movements or only part of movements, manualsegmentation is impractical for a deployed system, and exhaustive techniques areresource intensive.

In previous work, we looked at segmentation using an efficient energy-based ap-proach [Guenterberg et al. 2009]. We found that the method was effective for actionsseparated by rests. However, in a non-lab setting, actions often occur in rapid succes-sion with no rest between them. This problem is called continuous action recognitionand is addressed by applying a hidden Markov model (HMM)-based approach adaptedfrom the speech recognition community. The HMM can both segment and recognizesimultaneously. The novelty of our work is a system which (a) provides continuousaction recognition from inertial sensor data, (b) uses unsupervised clustering on aper-node basis to reduce the communication from each node, and (c) provides postu-ral information to eliminate impossible actions (e.g., walking from a sitting position).The hidden Markov model allows full signal segmentation of continuous actions witha low computational-order algorithm. To the best of our knowledge, this capabilityis not addressed in the literature. Since the sole goal of segmentation is to facilitaterecognition, the quality of segmentation will be judged entirely by presenting recogni-tion accuracy. Further, the results will be compared to k-NN classifier provided withmanual segmentation as a conceptual upper bound.

2. RELATED WORK

Several approaches to action recognition have been proposed. A common problem is notthe recognition but the segmentation of data into actions. Often in image recognitionthis is done without specific knowledge of what the image contains, but for actionrecognition using inertial sensors, it is generally not possible to infer a segmentationwithout some knowledge of what is being segmented. Various approaches address thisproblem in different ways.

Ward et al. recognized several workshop activities, such as taking wood out of adrawer, putting it into the vice, getting out a hammer, and more. They avoided theproblem of segmenting accelerometer data by segmenting the data using the presenceor absence of sound and then identified the action using accelerometer data and an


Automatic Segmentation and Recognition in Body Sensor Networks 46:3

HMM classifier [Ward et al. 2006]. Their results showed the effectiveness of this tech-nique for a shop, but in many other situations, actions are not correlated with sounds.

Another approach used a k-NN classifier and several statistical features to classifyactions using a minimum number of sensor nodes [Ghasemzadeh et al. 2008]. Manualsegmentation was used to avoid introducing errors from segmentation. This is a goodtechnique for isolating the performance of various parts of a system, but for a deployedsystem, a satisfactory segmentation scheme is necessary.

Alternatively, it is possible to try a number of segmentations and choose the best. Lvand Nevatia [2006] use 3-D motion capture data. Given a start and end time, each jointuses an HMM to identify the action. AdaBoost is then used to make a global decisionusing the HMMs as weak classifiers. A dynamic programming algorithm chooses thebest segmentation according to their maximum-likelihood function in O(T3) time. Thisscheme performs well if all computation is done on a single machine, but when eachHMM is employed on a separate sensor node, the communication overhead required totry the different segmentations is quite high.

Several authors classify fixed-size segments independently of each other [Bao andIntille 2004]. This can result in outliers and discontinuities. Many methods involvesome sort of smoothing function [Bao 2003; Courses et al. 2008; Van Laerhoven andGellersen 2004]. One such method uses AdaBoost to enhance several single-featureweak classifiers. An HMM uses the confidence output of the AdaBoost classifier asinput. A separate HMM is trained for each action class, and the overall segmen-tation/classification is chosen based on the maximum likelihood among the variousHMMs [Lester et al. 2005].

This is somewhat similar to our approach, except our model is based on a singleHMM, which allows us to rule out impossible sequences of actions and to avoid out-liers that could result from one model temporarily having higher probability than theothers. The main contribution of our algorithm is efficiently producing a segmentationand classification and performing this processing on a distributed platform.

Quwaider and Biswas [2008] divide actions, which they refer to as postures, basedon the activity level measured with accelerometers. With high-activity postures, suchas running, the postures are identified based on the energy level on each limb. Forrelatively quiet postures, such as sitting and standing, they employ a hidden Markovmodel used on radio signal strength (RSSI) differences between sensor nodes. Withthis, they can differentiate between sitting and standing postures. Our technique alsoidentifies postures as key to recognizing actions, but we use an inertial sensor ap-proach to achieve recognition and explicitly model actions as sequences of motions.This allows us to differentiate between actions that are different but may have a simi-lar level of activity, such as turning counter clockwise and turning clockwise.

Another HMM-based segmentation technique from our research group subdivides asingle action [Guenterberg et al. 2009]. This method is able to find the time of certainkey events within a known and possibly repeating action but is unable to determinethe action. The present article describes a method for segmenting and identifyingactions from a stream of sensor data. Therefore, these works are complementary butdifferent in terms of both goals and methods. The method described in Guenterberget al. [2009] is based on a left-right HMM and is able to determine the time whencertain events occur within an action, such as heel touch and toe raise during walking.While both that work and the present work rely on HMMs to process inertial data ofhuman movement, the present work uses a unique state structure built from posturestates and independently trained left-right models for each movement. Further, weintroduce the concept of a node-level unsupervised clustering technique, called motiontranscripts, to reduce communication load and model order. Finally, this work usesdata from multiple sensor nodes for most action recognition problems, while the



Fig. 1. Sensor placement.

Fig. 2. Sensor node.

previous work was aimed at data from a single sensor location (even though thepossibility of multiple locations was explored).

3. DATA COLLECTION HARDWARE

This article presents a scenario in which the actions a subject performs are identifiedfrom continuous data provided by a BSN. The sensor nodes are embedded comput-ing and sensing platforms with inertial sensors, wireless communication capability, abattery, and limited processing capabilities. Sensor nodes must be placed at multiplelocations on the body to capture sufficient information to accurately determine the ac-tion. For instance, the action placing something on a shelf and standing still producesimilar sensor data on the leg but different data on the arms, while turning to lookbehind and turn 90◦ show similarities from the shoulder but differences from the legs.The nodes communicate with a base station where a final conclusion is reached.

3.1. Sensing Hardware and Body Placement

Figure 2 shows one of the sensor nodes used to collect data for this article. The sen-sor nodes use the commercially available TelosB mote with a custom-designed sensorboard and are powered by two AA batteries. The processor is a 16-bit, 4 MHz TIMSP430. The sensor board includes a tri-axial accelerometer and a bi-axial gyroscope.Data is collected from each sensor at 20 Hz. This frequency was chosen empirically asa compromise between sampling rate and packet loss.

The sensor nodes are placed on the body, as shown in Figure 1. Placement was cho-sen so that each major body segment is monitored with a sensor. While we expect thatnodes placed at a subset of these locations would be sufficient for accurate classificationof all considered actions, no formal procedure was performed to select such a reducedset. Discovering such procedures could prove to be a fertile area for future research.

3.2. Constraints and Deployment Architecture

The goal of this research was to find a computationally realistic algorithm for segment-ing and classifying actions. Actually implementing this technique on sensor nodes isthe subject of future research. To this end, data collected on each sensor node wasbroadcast to a base station and recorded for later processing in the MATLAB envi-ronment. This gave us the most flexibility for developing and testing different signalprocessing and classification schemes.

The algorithms presented here assume the following deployment architecture: thesensor nodes are placed on the body, as shown in Figure 1. Each can communicate



Table I. Actions Captured

Initial FinalID Posture Action Posture

1 Stand Stand to Sit (Armchair) Sit2 Sit Sit to Stand (Armchair) Stand3 Stand Stand to Sit (Dining Chair) Sit4 Sit Sit to Stand (Dining Chair) Stand5 Sit Sit to Lie Lie6 Lie Lie to Sit Sit7 Stand Bend and Grasp from Ground (R Hand) Stand8 Stand Bend and Grasp from Ground (L Hand) Stand9 Stand Bend and Grasp from Coffee Table (R Hand) Stand

10 Stand Bend and Grasp from Coffee Table (L Hand) Stand11a Stand Turn Clockwise 90◦ Stand11b Stand Return from 11a Stand12a Stand Turn Counterclockwise 90◦ Stand12b Stand Return from 12a Stand

13 Stand Look Back Clockwise and Return Stand14 Stand Look Back Counterclockwise and Return Stand

15a Stand Kneeling (R Leg First) Kneel15b Kneel Return from 15a Stand16a Stand Kneeling (L Leg First) Kneel16b Kneel Return from 16a Stand17a Stand Move Forward 1 Step (R leg) Stand17b Stand Move L Leg beside R Leg Stand18a Stand Move Forward 1 Step (L Leg) Stand18b Stand Move R Leg beside L Leg Stand19a Stand Reach up to Cabinet (R Hand) Stand19b Stand Return from 19a Stand20a Stand Reach up to Cabinet (L Hand) Stand20b Stand Return from 20a Stand21a Stand Reach up to Cabinet (Both Hands) Stand21b Stand Return from 21a Stand

22 Stand Grasp an Object (1 Hand), Turn 90◦ and Release Stand23 Stand Grasp an Object (Both Hands), Turn 90◦ and Release Stand24 Stand Turn Clockwise 360◦ Stand25 Stand Turn Counterclockwise 360◦ Stand

directly with the base station. The nodes have a limited power supply and must lasta long time between recharges, so power must be conserved. The base station is a cellphone or PDA that has greater processing capabilities and can use significantly morepower. Wherever the final classification occurs, it must be transmitted to the basestation for storage or long-range communication.

Communication uses significantly more power than processing [Akyildiz et al. 2002;Polastre et al. 2005], so limiting communication is key to conserving power. Also, whilethe base station is more powerful than a sensor node, it is not as powerful as desktopcomputers, so algorithms designed to run on the base station should be of low compu-tational order.

3.3. Actions Collected

The actions considered are mostly transitional actions. Each starts and ends with aposture. The actions are shown in Table I. Not all postures sharing the same label areexactly the same. For instance, for stand at the end of actions 19a, 20a, and 21a, oneor both hands are on a shelf, whereas for most other cases stand means standing withhands resting at the side. The probabilistic nature of HMMs allows both postures tobe represented by the same state.



The actions were collected from three subjects who each performed the action tentimes. These actions were manually segmented to label the start and the end of eachaction. This manual segmentation is used to create sequences of known actions totrain the model and represents the ground truth. These are referred to as canoni-cal annotations. The data is divided into a training set and a testing set, with ap-proximately half the trials used for training and half for testing. The testing datahas also been manually annotated to allow for comparison between the segmentationand labeling automatically generated by our system and the ground truth (manualsegmentation).

4. CLASSIFICATION MODEL

One of the most difficult problems in classification is trying to label all the actions ina continuous stream of data in which both the timing of actions and the labels areunknown. This problem has been considered many times in speech recognition andis called continuous speech recognition [Rabiner and Juang 1986]. The problem ofspeech recognition is similar enough to action recognition that many techniques usedfor speech recognition can be applied with appropriate modifications to action recog-nition tasks. Jurafsky et al. [2000] present a model based on hidden Markov models(HMMs) where each word is represented by a separate left-right HMM.1 These arecombined into a single HMM by creating a null state which generates no output. Eachword starts from this null state and ends on the null state. A very similar approach isused for gesture recognition from hand sensors in Lee and Kim [1999].

We took this model and adapted it to fit within the constraints imposed by ourBSN configuration and to more effectively solve the problem of action recognition. Oneparticular change is that each action is assumed to start with some posture such askneeling or standing and end with a posture. The postures are not null states, that is,there is an output associated with a posture, and a posture may persist for a period oftime. The input for this system is a subject performing movements. These movementscan be in an arbitrary order. The output is a segmentation and a set of labels for eachsegment.

4.1. Overview

Classification requires a number of signal processing steps that execute on the sensornodes and the base station, as shown in Figure 3(a). The system is designed to accu-rately classify actions with limited communication and use of processing power. Thedata is processed on a moving window centered on the current sample. The windowmoves forward one sample at a time.

(1) Sensor Data. Data from five sensors is collected. The accelerometer senses threeaxes of acceleration: ax, ay, and az. The gyroscope senses angular velocity in twoaxes: θ and φ. There is no angular velocity from the axis orthogonal to the plane ofthe sensor board.

(2) Feature Extraction. For each sample time, a feature vector is generated. The fol-lowing features are extracted from a five-sample window for each sensor: mean,standard deviation, rms, first derivative, and second derivative.

1Hidden Markov models assume a process starting in a state which, at each discrete time, generates anoutput and then transitions to a new state. The state transition and output are probabilistic and basedexclusively on the current state. The output can be observed but not the state. Algorithms exist to traina model to a given set of output sequences and to infer the state sequence given an output sequence andmodel. A left-right model restricts transitions to self transitions or the next state in sequence. See Rabinerand Juang [1986] for more information.



Fig. 3. Signal processing and recognition models for continuous action recognition.

(3) Transcript Generation. Instead of transmitting the feature vector to the basestation, each sensor node labels a sample using a single character from a smallalphabet. Each sensor has a unique alphabet with between two to nine characters.Characters often repeat for several samples allowing for significant compression.The sequence of labels produced are motion transcripts. Transcript generation usesGaussian mixture models (GMM) to label samples based on clusters generatedfrom the training data.

(4) Hidden Markov Model. The HMM uses the model shown in Figure 3(b). Inthe middle are actions which are modeled as left-right HMMs with betweenMw = 1, 2, · · · , 10 states. The postures on the left and right are each modeledusing a single state. The duplicated postures represent the same state. Posturesand actions are connected as shown to form a single HMM.

(5) Generating Output. When trying to segment and classify data, the Viterbi algo-rithm [Rabiner and Juang 1986] is used to find the most likely state sequencefor the given output. The Viterbi is used because it finds the optimal sequenceefficiently. Each sample is labeled with the name of the action or posture of theassociated state. This output is the generated annotations.

4.2. Transcript Generation

Reducing data before transmission can save considerable power in a BSN. Transcriptsdo this by reducing the multidimensional per-sample observations on a sensor node toa single character taken from a small alphabet. Transcripts are inspired by the ideathat actions can be represented by a sequence of motions. Motions can be identifiedfrom a small interval of observations. A single motion or position is likely to persistfor some time, allowing run-length encoding to further reduce transmitted data.

We have no canonical list of motions; therefore, a technique is needed that does notrequire human input. One solution is unsupervised clustering which automaticallygroups points based on underlying patterns in the data. Once these groups are createdfrom training data, later observations may be assigned to one of the existing groups.In our system, the points are feature vectors in F-dimensional space.

The most common clustering techniques include hierarchical clustering [Johnson1967], k-means clustering [Hartigan and Wong 1979], and model-based clustering[Figueiredo and Jain 2002; Fraley and Raftery 1998]. Model-based clustering assumesthat all points have been generated from a set of distributions. The Gaussian mixturemodel (GMM) is a model-based clustering using Gaussian distributions. Many real-istic processes actually generate output based on Gaussian distributions, and many



more can be approximated by a small number of Gaussian distributions. This causesGMMs to often outperform other methods of clustering [Fraley and Raftery 1998]. Forthese reasons, GMMs are used for transcript generation. We assume a diagonal covari-ance matrix to reduce computational complexity for labeling and to alleviate errors inestimation due to the small sample size problem [Raudys and Jain 1991].

There is an independent transcript, Ti, for every node. Each frame of data fromthe node can be labeled with one of the Ci symbols, each of which is represented by aGMM, λ j. λ j has Mj Gaussian distributions. As shown in Equation (1), each mixtureis represented by three parameters, that is, the mixing parameters p, which representthe prior probability of the distribution generating a given point. μ and � are theGaussian mean and covariance matrices.

λ ={

(p1, μ1, �1) , · · · ,(pMj, μMj, �Mj

) }. (1)

The probability that a given observation x was generated by the mixture λ j is

p(x|λ) =Mj∑i=1

pi p(x|μi, �i). (2)

Training a set of clusters involves choosing clusters which best model the trainingdata while having a low enough order to avoid overfitting to the training set at theexpense of good generalization. The maximum likelihood (ML) criterion for the bestcluster model for a given set of observed points x ∈ X is

λmax = argλCimax

λ=λ1

P(X |λ) = argλCimax

λ=λ1

T∏t=1

p(xt|λ). (3)

This cannot be analytically calculated, so frequently, the expectation maximization(EM) procedure is used [Figueiredo and Jain 2002; Reynolds and Rose 1995]. Thisiteratively converges to a locally optimal clustering. EM starts with an initial solutionthen iteratitively improves the solution with the following steps.

(1) Expectation. For each point and mixture component, calculate the probability thatthe point was generated by the component’s distribution.

(2) Maximize. Update the model parameters to maximize the likelihood function usingthe membership probabilities calculated in the preceding expectation step.

The initial model and the choice of the number of mixtures (M) affect the final qual-ity of the clusters. A common method of choosing M is to use EM to train severalmodels using different values of M and starting distributions. Then the models canbe compared using some measure, and the best is selected. We use the Bayesian in-formation criterion to compare models [Fraley and Raftery 1998]. The clustering isperformed independently on each node, and each node may have a different number ofclusters.

4.2.1. Assigning New Observations to Clusters. Transcripts for an observed action aregenerated by assigning each sample to a single cluster. These cluster assignments areused to label each sample, resulting in a string from a finite alphabet representing themotion, as observed from a single sensor node. Using Equation (4), a feature vector fora given sample is assigned the cluster most likely to have generated it.

c(x) = argM

maxi=1

pi p(x|μi, �i). (4)



4.2.2. Implementation on Sensor Nodes. Cluster assignment in a deployed system willrun on a sensor node, so efficiency is very important. One way to achieve this is touse log probabilities to assign the clusters. Specifically, by extending Equation (4), thefollowing relationship is achieved.

c(x) = argM

maxi=1

pi p(x|μi, �i) (5)

= argM

maxi=1

[log |pi| + log

∣∣p(x|μi, �i)∣∣] (6)

= argM

maxi=1

[log |pi| + log |a| − 1

2

F∑k=1

(xik − μik)2

σ 2ik

]. (7)

Equation (6) follows from the logarithmic function being monotonic and strictlyincreasing. The expansion of the log Gaussian probability in Equation (7) as asummation is the consequence of using a diagonal covariance matrix. The variablea = 1

2π F/2|�|1/2 and represents the normalization factor of the distribution.Furthermore, given the lack of a floating point unit on most sensor node hardware,

calculations should be done with fixed point arithmetic.

4.3. Hidden Markov Model

After each sensor node assigns a character to each sample, the actions can be deter-mined on the base station using the HMM shown in Figure 3(b).

4.3.1. Training the Model. An HMM has M states and is defined by the model λ, con-sisting of three sets of probabilities.

λ = {πi, aij, bj (k)}. (8)

The probability that a sequence begins with state si is πi. The transition probability aijis the probability that a state transitions to state sj after starting on state si. For dis-crete observations, b j(k) gives the probability that observation vk is emitted at state sj.

For our system, the left-right model for each action and posture is trained indepen-dently, then all actions and postures are joined to form a single HMM. The model foreach action is trained by starting with an initial model and iteratively improving itusing the Baum-Welch procedure [Rabiner and Juang 1986]. The Baum-Welch proce-dure finds a local minimum. By trying a number of variations and selecting the bestmodel according to some measure, the likelihood of finding a global minimum is in-creased. As with GMMs, a common technique for model selection is BIC [Biem 2003;Stoica and Selen 2004]. We try Mw ∈ 1, 2, · · · , 10. For each of these fixed-state models,we start by dividing samples in each sequence evenly among states, then iterating theEM algorithm five times to converge the model. For the next nine tries for the currentMw, each state is initially assigned a random number of samples. The best of these100 models is used to represent the action.

Each component of the HMM model in Equation (8) must be trained. Since theaction must start at the first state, m1,

πi ={

1 if i = 10 otherwise.

(9)

aij and bj (k) are trained using the Baum-Welch algorithm, as described in Rabinerand Juang [1986]. The Baum-Welch algorithm is another EM algorithm. At each step,the state sequence probabilities are computed, and this is used to update aij and bj (k)



based on the expected transitions and observations corresponding to each state, asseen in Equations (10) and (11).

aij =E

[number of transitions from si to sj

]E

[number of transitions from si

] . (10)

b ( f )j (k) =

E[number of times in sj and observed symbol vk

]E

[number of times in sj

] . (11)

The observation probabilities for each sensor node are considered independent ofother nodes f . This means that b ( f )(k) j is computed for each sensor node separately,and the overall observation probability is.

bj (k) = Mw

√∏f∈F

b( f )j (k). (12)

In speech recognition, a scaling factor, called the language model scaling factor (LMSF)is used to compensate for an incorrect assumption of independence [Wessel et al. 1998]between observations. We adapt this concept by taking the Mwth root of the observa-tion probabilities.

Right after the action finishes, it is expected to immediately transition to the pro-ceeding posture. Therefore, during training, only state sequences ending in the finalstate, sMw

should be considered. This can be accomplished simply if the observationprobability takes the sample number into account and makes the probability 0 for allobservations from a state other than the final state for the final observation in a givensequence.

b ′j (k, t) =

{0 if j �= Mw and t = Tbj (k) otherwise.

(13)

4.3.2. Bayesian Information Criterion. Selecting the best among several models is a com-mon problem in statistical pattern recognition. In general, a model with a greaternumber of parameters will better fit any set of training data but runs the risk of fittingeccentricities in the training data not present in the test data. The Bayesian informa-tion criterion [Biem 2003] is a method that only requires training data and has strongprobabilistic properties.

BIC(λ) = log p(X |λ, θ

) − αK2

log T. (14)

For HMMs, p(X |λ, θ) can be computed in a straightforward manner, as given inRabiner and Juang [1986]. T is the total number of samples in the training set, andα is a regularizing term that allows for control over the overfitting penalty. We chose



a value of α = 0.05 in which each action was represented with an average of threestates. K, the number of free parameters, is

K = 2(M − 1) + Mnnodes∑i=1

ci − 1. (15)

M is the total number of states, and ci is the number of clusters on the given sensornode. The first term represents the number of transition probabilities, and the secondthe number of emission probabilities.

4.4. Joining Models

The models for each action are joined into a single HMM. Posture self-transition prob-abilities are derived from the training data, while the probability of transition from aposture to all associated actions are considered to be equal probability.

4.5. Sequence Extraction for Classification

After the model is fully deployed, it will be deployed on a BSN. The clustering operateson individual nodes, and the cluster sequences are transmitted to the base station,where the HMM is used to segment and classify the actions. For this, the most likelystate sequence is extracted using the Viterbi algorithm [Rabiner and Juang 1986]. Thestates in the sequence are grouped by action; the individual state progression withinan action is considered unimportant. The model dictates that each occurrence of anaction is separated by at least one posture state. This means that even if an actionwere to repeat many times, the number of repetitions can be easily counted.

4.6. Runtime Order and Comparison to Other Methods

At runtime, there are two primary stages: transcript generation and HMM-basedrecognition. For transcripts, the probability that the feature vector at each sample wasgenerated by each cluster is calculated. The sample receives the label of the highestprobability cluster. Because a diagonal covariance matrix is used, calculating proba-bilities is a linear function of the number of features.

O(Clustering) = O(T · F · Ci), (16)

where T is the number of samples considered, F is the number of features in thefeature vector, and Ci is the number of clusters on sensor node i.

The HMM uses the Viterbi algorithm [Rabiner and Juang 1986] for classificationand segmentation. Since the HMM consists of several joined left-right models, thesimplified Viterbi runs more efficiently.

O(Classifier) = O(T · 2Ma · K). (17)

In Equation (17), the total number of action states is Ma, and K is the number of sensornodes.

This method must be compared to other methods that segment and classify dataand cannot be compared with methods that rely on external data to segment the data.The method in Lv and Nevatia [2006] is O(T3) and so has a lower runtime efficiency:the approach is not designed for sensor networks and thus implies a large number oftransmissions. Yang et al. [2009] propose a system that can be either used on fixedsegments or can adaptively choose a segmentation. Both are linear with respect to thenumber of samples, however, there are large constant factors, such as the number oflength hypotheses, and repeated computation of an inverse matrix for each hypothesis.

Finally, there are a number of methods based on fixed segmentation. These also cando no better than linear time. However, the associated constant factors may be smaller



Table II. Results for All Subjects Trained on One Model

Subject k-NN Full Samples Accuracy Clustering Accuracy1 96.5 78.1% 92.6%2 93.2 38.1% 82.6%3 99.4 10.9% 83.2%

than for our model. These methods do not take temporal characteristics into account,so actions that use the same motions for a portion of time will be indistinguishable,even if the sequence is unique for each action.

5. RESULTS

For our experiment, three subjects performed the actions listed in Table I using thebody sensor configuration in Section 3. The system was trained using approximatelyhalf the data. An action is considered properly labeled if over 50% of the action waslabeled as the action and the rest was labeled as either the start or the end posture. Ifany part of an action was labeled as a different action, then the action is considered tobe incorrectly labeled.

5.1. Comparison to Other Methods

The advantage of the classification technique developed in this article is the ability toboth segment and classify data. In this section, the accuracy of our method is explored.However, it is useful to compare this with other techniques. First, we look at theaccuracy when using k-NN classification with manually performed segmentation, asproposed in Ghasemzadeh et al. [2008]. The implementation uses data fusion to makea decision based on a full feature vector containing data from each node instead of thedecision fusion technique outlined. Because the k-NN test uses manual segmentationand features extracted from all the data instead of from the transcripts, the k-NNresults represent a conceptual upper bound on the accuracy.

The second method we choose for comparison is the HMM outlined in this article,but using features linearly quantized into nine levels. This has the potential for higheraccuracy than clustering, since clustering potentially discards useful information. Forour results, this method resulted in considerably higher error, probably due to a com-bination of overfitting, over-simple quantization, and lack of a feature-selection tech-nique. The need for feature selection is especially likely, as there are only five trainingtrials for each subject and over 20 features, some of which may be fairly useless orhighly correlated with other features.

5.2. Classification Accuracy

Table II shows the result for each subject when a single model was trained on allsubjects. With clustering, accuracy for each subject is reasonable, especially given thesimilarity of the movements. As expected, k-NN on an ideal (human-generated) seg-mentation outperforms the HMM. The lower accuracy of the HMM is compensated bysignificantly less required data transmission, as well as the automatic segmentationprovided by the HMM.

Visual inspection of the transcripts shows consistency within subjects but markeddifferences between subjects for the same movements. This is expected because ofthe differences in the way subjects perform movements, that is, everybody sits downon the bed slightly differently. Subject 1 exhibited much higher intertrial consistencythan the other two subjects, leading to Subject 1 dominating the trained model. Thisis the reason for Subject 1 having 10% higher accuracy than the other two subjects.By independently training the model on a per-subject basis, the bias towards a



Table III. Results for All Subjects Trained Individually

Subject k-NN Full Samples Accuracy Clustering Accuracy1 97.5 95.1% 90%2 91.5 48.9% 94%3 99.4 47.4% 94%

particularly consistent subject can be eliminated. The results of this approach areshown in Table III. The improvements for Subjects 2 and 3 are considerable, while theaccuracy of Subject 1 actually dropped. A clue to this drop can be found in the trainingaccuracy: Subject 1 has 100% accuracy over the training set. This is a symptom ofoverfitting. The classic solution to overfitting is increasing the number and variationof training trials. Subject 1’s consistency is the primary reason that the subject hadan overfitting problem sooner than the other subjects. The biggest disadvantage inthis approach compared with training one model for all subjects is the requirement ofmore training data for each subject.

Once again, clustering outperforms full use of all samples but generally fails to beatk-NN, although the HMM with clustering approach produced the best results of any ofthe approaches for Subject 2.

Detailed results for Subject 1, as displayed in Table III, are shown in Table IV. Ofspecial interest is the confusion column. Some of the misclassifications are expected.For instance, picking an object off the ground and off a coffee table are very similar, andso the confusion of Actions 10 and 8 make sense. Similarly, turning counterclockwise90◦ is quite similar to returning from a clockwise turn. However, the confusion ofreaching up to a cabinet with left hand and move forward one step makes little senseand so represents a true error.

A visual representation of the segmentation and classification process for Subject 1is shown in Figure 4(a), and for the same movement with Subject 3 in Figure 4(b). Theclusters are on the bottom. The labels in red are the canonical annotations (groundtruth), while the ones above in blue are generated annotations (system output). Thegrayscale bar at the top represents the progression of states. Movements 11a and 11bare turn counterclockwise 90◦ and return, respectively. Movements 12a and 12b areturn clockwise 90◦ and return, respectively. The clusters from the left thigh show avery consistent pattern, while the clusters from the waist and right arm show signif-icant variation. The HMM is able to accurately identify these actions from among allpossible actions, as can be seen from the labeling at the top. Sometimes Subject 1misidentifies a clockwise turn as the return from a counterclockwise turn, which is notnecessarily even a mistake: the two may be impossible to distinguish even for a per-son. The transcripts for each subject are markedly different, even though both comefrom the same action. These figures give clear motivation to prepare separate modelsfor each subject.

5.3. Bandwidth Savings from Clustering

The primary reason for choosing clustering instead of transmitting samples directlywas decreasing transmissions. In Table V, the savings are shown. In the first column,the uncompressed, 12-bit per sensor data is transmitted, with results shown in Bytesper second. For the next column, sensor data is first quantized (with nine possiblebins per sensor) then represented by two pieces of information: the quantized labeland duration of samples of that value. The results shown are the average entropy peroriginal sample. Coding methods, such as Huffman’s, come close to achieving entropy,so this is a reasonable estimate of bandwidth. The final column is similar, exceptinstead of quantized sensor data, clustering is performed.



Table IV. Results for an Independently Trained Subject 1

Action # States Accuracy # Actions Confusion1 6 100% 122 5 100% 123 4 100% 114 4 100% 115 3 100% 106 5 100% 107 4 100% 58 4 100% 59 3 100% 5

10 3 80% 5 8:111a 4 60% 5 12b:211b 3 100% 512a 3 0% 5 11b:512b 3 100% 5

13 6 100% 514 7 67% 6 19a:1, 19b:1

15a 6 100% 515b 3 100% 516a 4 100% 516b 4 100% 517a 3 100% 617b 3 60% 5 18a:218a 1 80% 5 17b:118b 3 100% 519a 2 80% 5 11b:119b 3 100% 520a 2 60% 5 18a:1, 21a:120b 2 20% 5 10:1, 17a:1, 21b:221a 2 100% 521b 3 100% 5

22 10 100% 624 4 100% 525 4 100% 5

Total 90%

The savings are most dramatic when compression of any kind is applied; however,clustering still reduces the bandwidth by about 75%.

5.4. Rejection Criterion

One application of this work is the life-logging application in which a participant wearssensors for several days, and the system automatically creates a diary of actions per-formed. Most classification systems, including our hidden Markov model, try to clas-sify an action as being one of several possible actions. For life logging, many actionsare novel and should not be labeled as one of the training actions. This none of theabove labeling is also called a rejection criterion. There are several rejection strate-gies. The most basic is to compute a confidence for each action and if the measure isunder a threshold, the action is rejected as not belonging to any known class [Kashiet al. 1998]. Another strategy is to train one or more rejection classes on a variety ofdata that is to be rejected. The prior probability of this class can be changed to raiseor lower the rejection threshold [Rosenberg et al. 1998]. In our work, we investigatethreshold-based methods.

As the model in this article already extracts the most likely sequence using theViterbi algorithm, a natural confidence measure is the log-likelihood probability ofa given action in the most likely sequence. The likelihood of the most likely state



Fig. 4. Classification results for the action turn Clockwise 90◦.

Table V. Data Savings from Clustering

Subject Uncompressed (B/s) Samples Cmp. (B/s) Clustering Cmp. (B/s)1 165.00 10.91 2.782 165.00 11.93 2.973 165.00 13.44 3.21

sequence decreases as the length of the sequence increases; therefore, it should benormalized by the path length [Kashi et al. 1998].

As will be seen in the results, this rejection criterion using a fixed threshold ledto many false rejections and false recognitions. Looking at the state progression, asseen in the state progression bar in Figure 4(a), correctly recognized actions tendedto spend approximately equal time in each state within the action, while incorrectlyrecognized actions tended to linger on one state for most of the time, then go througha rapid set of transitions to get to the next action. A simple way to quantify this isto look at the entropy of the observed chain compared to the theoretical entropy ofthe model. Entropy is maximized if all states have equal representation. Maximumentropy is Hmax(x) = log |X | for x ∈ X . We set a threshold for negative entropy. Theaction is accepted if

−k · log |X | ≥ −H(Si), (18)

where H(Si) is the entropy of the state sequence extracted by the Viterbi algorithm.For testing several possibilities, movements 13–18b in Table I were removed from

the training set to provide a set of actions to be rejected. The highest possible entropyis 0 for one state and goes up with the number of total states. To facilitate this mea-sure, the models were constrained all to have five states per action. The results areshown in Figure 5. In this diagram, the top portion is identical to the clustering dia-grams. The bottom shows various rejection criteria. The red lines are the likelihoodof the current state and transition in the chain. The solid black lines are the averagelikelihood probabilities for the entire action. The solid green line is the entropy rejec-tion threshold and the dashed blue line is the observed entropy. The entropy thresholdis 80% of the expected entropy based on the trained model. For Figures 5(b) and 5(c),the actions were not part of the training set. For this reason, the data all theoreticallyrepresents non-actions (except for the postures); therefore, no ground truth data isshown. In these figures, we expect the log likelihood and entropy measures to go down.



Fig. 5. Rejection thresholds.

In Figure 5(a), the negative entropy is always below the threshold, so all relevantactions are correctly identified. The actions in Figures 5(b) and 5(c) are part of therejection threshold. Some of the actions will be rejected by the entropy criterion,



but others will not. Looking at the average log likelihood combined with the entropymeasure, it looks like sufficient evidence exists to reject. Movement 25 is shown inFigure 5(d) to show the thresholds for a correctly identified movement in contrast tothe incorrect identifications for rejection class Movement 13.

5.5. Hardware Implementation

We recently implemented transcript generation on a set of sensor nodes based aroundthe hardware described in Section 3.1. The transcripts produced were consistent withthose produced in MATLAB.

6. CONCLUSION AND FUTURE WORK

In this article, we presented an action recognition framework based on an HMM whichis capable of both segmenting and classifying continuous movements. It is specificallydesigned for the distributed architecture of body sensor networks and has modest run-time requirements, which is essential for resource-limited sensor nodes. The accuracyis consistent with results reported from similar experiments described in literature,but below that of a k-NN system using manual segmentation. We also examined thepossibility of using thresholds based on log-probability and entropy to reject unknownmovements. The methods are promising, but further work is needed to perfect them.

For deployment, several additional steps must be taken. First, in a system designedto monitor a subject throughout the day, many actions performed by the subject will notrepresent any of the trained actions. The system will need to not only recognize knownactions but reject unknown actions. [Yoon et al. 2002] suggest a method based onrejection thresholds that could be used. Second, this system needs to be implementedon a BSN. Since our MATLAB tests proved successful, this is our next major step.

REFERENCESAKYILDIZ, I., SU, W., SANKARASUBRAMANIAM, Y., AND CAYIRCI, E. 2002. Wireless sensor networks: a

survey. Comput. Netw. 38, 4, 393–422.AMINIAN, K., NAJAFI, B., BULA, C., LEYVRAZ, P., AND ROBERT, P. 2002. Spatio-temporal parameters of

gait measured by an ambulatory system using miniature gyroscopes. J. Biomech. 35, 5, 689–699.BAO, L. 2003. Physical activity recognition from acceleration data under semi-naturalistic conditions. Ph.D.

dissutation, Massachusetts Institute of Technology, Cambridge, MA.BAO, L. AND INTILLE, S. 2004. Activity recognition from user-annotated acceleration data. In Proceedings

of the 2nd International Conference on Lecture Notes in Computer Science Pervasive Computing. vol.3001, Springer, Berlin, 1–17.

BIEM, A. 2003. A model selection criterion for classification: Application to hmm topology optimization. InProceedings of the 7th International Conference on Document Analysis and Recognition. 104–108.

CASTELLI, G., ROSI, A., MAMEI, M., AND ZAMBONELLI, F. 2007. A simple model and infrastructure forcontext-aware browsing of the world. In Proceedings of the IEEE International Conference on PervasiveComputing and Communications. 229–238.

CHOUDHURY, T., BORRIELLO, G., CONSOLVO, S., HAEHNEL, D., HARRISON, B., HEMINGWAY, B.,HIGHTOWER, J., KLASNJA, P., KOSCHER, K., LAMARCA, A., LANDAY, J. A. and LESTER, J. 2008. Themobile sensing platform: An embedded system for capturing and recognizing human activities. IEEEPervasive Mag. Special Issue on Activity-Based Computing.

COURSES, E., SURVEYS, T., AND VIEW, T. 2008. Analysis of low resolution accelerometer data for continuoushuman activity recognition. In Proceedings of the IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP’08). 3337–3340.

FIGUEIREDO, M. AND JAIN, A. 2002. Unsupervised learning of finite mixture models. IEEE Tran. PatternAnal. Mach. Intell. 24, 3, 381–396.

FRALEY, C. AND RAFTERY, A. 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41, 8, 578–588.

GHASEMZADEH, H., GUENTERBERG, E., GILANI, K., AND JAFARI, R. 2008. Action coverage formulationfor power optimization in body sensor networks. In Proceedings of the Asia and South Pacific DesignAutomation Conference (ASPDAC’08). 446–451.



GUENTERBERG, E., OSTADABBAS, S., GHASEMZADEH, H., AND JAFARI, R. 2009. An automatic segmenta-tion technique in body sensor networks based on signal energy. In Proceedings of the 4th InternationalConference of Body Area Networks (BodyNets’09).

GUENTERBERG, E., YANG, A. Y., GHASEMZADEH, H., JAFARI, R., BAJCSY, R., AND SASTRY, S. S. 2009. Amethod for extracting temporal parameters based on hidden Markov models in body sensor networkswith inertial sensors. IEEE Trans. Inform. Technol. Biomed. 13, 6, 1019–10300.

HARTIGAN, J. AND WONG, M. 1979. A K-means clustering algorithm. JR Stat. Soc. Ser. C-Appl. Stat. 28,100–108.

HAUSDORFF, J., CUDKOWICZ, M., FIRTION, R., WEI, J., AND GOLDBERGER, A. 1998. Gait variability andbasal ganglia disorders: Stride-to-stride variations of gait cycle timing in Parkinson’s disease and Hunt-ington’s disease. Mo. Disord. 13, 3, 428–37.

JOHNSON, S. 1967. Hierarchical clustering schemes. Psychometrika 32, 3, 241–254.JURAFSKY, D., MARTIN, J., KEHLER, A., VANDER LINDEN, K., AND WARD, N. 2000. Speech and Language

Processing: An Introduction to Natural Language Processing, Computational Linguistics, and SpeechRecognition. MIT Press, Cambridge, MA.

KASHI, R., HU, J., NELSON, W., AND TURIN, W. 1998. A hidden Markov model approach to online hand-written signature verification. Int. J. Doc. Anal. Recog. 1, 2, 102–109.

LEE, H. AND KIM, J. 1999. An HMM-based threshold model approach for gesture recognition. IEEE Trans.Pattern Anal. Mach. Intell. 21, 10, 961–973.

LESTER, J., CHOUDHURY, T., KERN, N., BORRIELLO, G., AND HANNAFORD, B. 2005. A hybrid discrim-inative/generative approach for modeling human activities. In Proceedings of the International JointConference on Artificial Intelligence (IJCAI’05).

LV, F. AND NEVATIA, R. 2006. Recognition and segmentation of 3-D human action using HMM and multi-class AdaBoost. In Proceedings of the 9th European Conference on Computer Vision (ECCV’06). LectureNotes in Computer Science vol. 3954, Springer, Berlin, 359.

NAIT-CHARIF, H. AND MCKENNA, S. 2004. Activity summarisation and fall detection in a supportive homeenvironment. In Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04).

OUCHI, K., SUZUKI, T., AND DOI, M. 2004. LifeMinder: A wearable healthcare support system with timelyinstruction based on the user’s context. In Proceedings of the 8th IEEE International Workshop on Ad-vanced Motion Control (AMC’04). 445–450.

POLASTRE, J., SZEWCZYK, R., AND CULLER, D. 2005. Telos: Enabling ultra-low power wireless research.In Proceedings of the 4th International Symposium on Information Processing in Sensor Networks.

QUWAIDER, M. AND BISWAS, S. 2008. Body posture identification using hidden Markov model with a wear-able sensor network. In Proceedings of the ICST 3rd International Conference on Body Area Networks.

RABINER, L. AND JUANG, B. 1986. An introduction to hidden Markov models. ASSP Mag. 3, 1, 4–16. Seealso IEEE Signal Process. Mag.

RAUDYS, S. AND JAIN, A. 1991. Small sample size effects in statistical pattern recognition: Recommenda-tions for practitioners. IEEE Trans. Pattern Anal. Mach. Intell. 13, 3, 252–264.

REYNOLDS, D. AND ROSE, R. 1995. Robust text-independent speaker identification using Gaussianmixturespeaker models. IEEE Trans. Speech Audio Process. 3, 1, 72–83.

ROSENBERG, A., SIOHAN, O., AND PARATHASARATHY, S. 1998. Speaker verification using minimum ver-ification error training. In Proceedings of the IEEE International Conference on Acoustics, Speech andSignal Processing.

SHERIDAN, P., SOLOMONT, J., KOWALL, N., AND HAUSDORFF, J. 2003. Influence of executive function onlocomotor function: Divided attention increases gait variability in Alzheimer’s disease. J. Am. GeriatricsSoc. 51, 11, 1633–1637.

STOICA, P. AND SELEN, Y. 2004. Model-order selection: A review of information criterion rules. IEEE SignalProcess. Mag. 21, 4, 36–47.

VAN LAERHOVEN, K. AND GELLERSEN, H. 2004. Spine versus Porcupine: A study in distributed wearableactivity recognition. In Proceedings of the 8th IEEE International Symposium on Wearable Computers.142–149.

WARD, J., LUKOWICZ, P., TROSTER, G., AND STARNER, T. 2006. Activity recognition of assembly tasksusing body-worn microphones and accelerometers. IEEE Trans. Pattern Anal. Mach. Intell., 28, 10,1553–1567.

WESSEL, F., MACHEREY, K., SCHLUTER, R., FUR INF, L., AND AACHEN, T. 1998. Using word probabilitiesas confidence measures. In Proceedings of the IEEE International Conference on Acoustics, Speech andSignal Processing.



YANG, A., JAFARI, R., SASTRY, S., AND BAJCSY, R. 2009. Distributed recognition of human actions usingwearable motion sensor networks. J. Ambient Intell. Smart Environ. 1, 1–5.

YOON, H., LEE, J., AND YANG, H. 2002. An online signature verification system using hidden Markovmodel in polar space. In Proceedings of the 8th International Workshop on Frontiers in HandwritingRecognition. 329–333.

Received October 2009; revised June 2010; accepted September 2010


Date post:	22-Sep-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Automatic Segmentation and Recognition in Body Sensor...

Documents