127
CHAPTER 6
HUMAN BEHAVIOR UNDERSTANDING MODEL
6.1 INTRODUCTION
Analyzing the human behavior in video sequences is an active field
of research for the past few years. The vital applications of this field could
include the monitoring of behaviors for secure installations, video
surveillance, video retrieval and human computer interaction systems. The
main objective is to recognize and predict the behavior and detect
abnormalities. Currently, many researchers have contributed many
approaches for predicting the behavior as a post processing task. In this work,
it is proposed to analyze the behavior of the human action during the course.
In the common scenario, like parking lots and supermarkets, the visual
surveillance system should analyze abnormal behavior (as an indicative of
theft) and raise an alarm to alert the visual analysts. Hence, the action patterns
of the people should be analyzed and the state of action should be detected as
either ‘normal’ or ‘abnormal’ to understand the behavior.
The characterization of human behavior is equivalent to dealing
with a sequence of video frames that contains both the spatial and temporal
information (Cadamo et al 2010). The temporal information conveys more
details for human behavior understanding. Normally, human posture analysis
is the basic step to extract the temporal information. During human posture
analysis, various human behavior patterns are exhibited in the form of key
postures like ‘turnleft’, ‘guardkick’, ‘falldown’ etc.
128
This chapter presents a novel human behavior understanding model
that analyses the human movements and learns the human posture status
either as ‘normal’ or ‘abnormal’ from the video sequences using Probabilistic
Global Action Graph (PGAG). According to the posture analysis, the status of
the human behavior can be predicted as either ‘normal’ or ‘abnormal’ using
the proposed approach. The process flow of the human behavior
understanding model is shown in Figure 6.1.
Figure 6.1 Process flow of the human behavior understanding model
The proposed human behavior understanding model consists of two
phases, namely training and testing phases. During the training phase, the
following pipeline of processes is involved: (i) Foreground segmentation, (ii)
Feature Extraction, (iii) Vector Quantization, and (iv) probabilistic Global
Action Graph construction.
Test video
State
likelihood
Feature
Extraction
Normal / abnormal
status
Behavior
Alarming
Training Phase
Testing Phase
Foreground
Segmentation
TSOC
COC
Probabilistic
Global Action
Graph (PGAG)
Feature
Extraction
VQ
Preprocessing Vector
Quantization
Behavior Modeling
arg max Pij
j
Video capturing Silhouettes
Normal /abnormal
VQ symbols
129
(i) The pixel layer based approach is used as an initial
preprocessing step to segment the foreground (human
silhouette) from the action video.
(ii) TSOC and COC are identified as features, which are extracted
from each silhouette.
(iii) In vector quantization, the aim is to group similar postures
together and create a finite number of key postures
representing the code book. In any case, 35 dimensional shape
features for each key posture is symbolized as a one
dimensional ‘VQ symbol’ in code book.
(iv) A semiautomatic state space approach based human
understanding model is simulated using Probabilistic Global
Action Graph (PGAG).
In order to experiment the designed model, the test phase is
formulated, which has two major steps, namely
(i) For the input sequence of silhouette, the likelihood of key
posture is identified using similarity measure.
(ii) The key posture is analyzed as either ‘normal or ‘abnormal’
via PGAG with the help of an alarm.
In any real-time system, the behavior model is very essential to
understand domain knowledge irrespective of action. The rest of the chapter
focuses mainly on designing domain specific behavior model based on key
posture transitions.
130
6.2 PROBABILISTIC GLOBAL ACTION GRAPH (PGAG)
In general, every action has a finite number of key postures and
there exists a bound relation across actions due to restrictive movement of the
human body. Hence, the related action sequences may share a few of the key
postures. The relations can be denoted in two cases:
(i). Considering the ‘walk’ and ‘turnaround’ normal actions, few
of the frequently correlated postures are ‘heel-strike’, ‘toe-off’,
‘mid-stance’ etc.
(ii). Similarly, for ‘shotgun’ and ‘fell down’ abnormal actions,
‘raise hand’, ‘point’, ‘bend-knee’, ‘lower body’ and ‘fall on
floor’ are the correlated postures. Normally, few actions
exhibit similar key postures either at the beginning or ending
point of their occurrence.
A weighted directed action graph termed as PGAG is constructed,
which acquires and distinguishes the posture transitions across various
composite actions globally. In this graph, each node represents a key posture.
The weighted link between nodes represents the transitional probability
between the two key postures. The temporal characteristic of each action is
obtained using the posture transitions. Under the state level hypothesis, the
transitions among nodes signify the occurrence of an event. Events can be
defined based on dominant and persistent characteristics of the posture
transitions. Hence, the PGAG possesses the characteristics of understanding
the behavior in terms of posture transitions.
131
6.2.1 Construction of PGAG
The PGAG is constructed using probabilistic posture transition
matrix. The steps involved during the construction of PGAG are as follows:
1. Consider the number of nodes in PGAG, which are equal to
the number of VQ symbols in the posture code book. (i.e. # of
PGAG nodes = # of key postures).
2. For each possible posture, the posture transition probability
(Pij) is obtained between key postures ‘i’ and posture ‘j’
Pij ='i'Posturefromstransitionof#
'j'Postureto'i'Posturefromstransitionof# (6.1)
3. The posture transition probability between the postures is
constrained with,
1Pm
1jij
(6.2)
where the sum of transition probabilities from ith
posture to all other jth
postures must be equal to ‘1’, and ‘m’ is the total number of key postures.
Hence, the posture transition matrix has the dimension of m x m.
The PGAG using six key postures of ‘runstop’ action are depicted
in Figure 6.2. In this graph, the ‘runstop’ action is performed in multiple
views. The transition paths are normally cyclic in nature and there exist
specific beginning or ending key postures as detailed in Figure 6.2.
132
0.3
0.4540.37
0.322
0.45
0.130
0.1650.245
0.125
0.226
P50.016
P0
P1 P2 P3 P4
0.34
0.19
0.228 0.2 0.48
0.109
1
0.13 0.52
1
Figure 6.2 PGAG with 6 nodes (P0-P5) and posture transition
probabilities for ‘runstop’ action
The corresponding posture transition matrix using PGAG is listed
in Table 6.1. Also, the posture transition matrix has non-zero probabilities at
the lower diagonals, the main reason here is that the ‘runstop’ action has
temporally related key postures with strict left-to-right transitions.
Table 6.1 Posture transition matrix for ‘runstop’ action
Pij P0 P1 P2 P3 P4 P5
P0 0.109 0.165 0.226 0.125 0.245 0.130
P1 0.000 0.228 0.450 0.000 0.000 0.322
P2 0.000 0.130 0.200 0.370 0.000 0.300
P3 0.000 0.000 0.190 0.340 0.454 0.016
P4 0.000 0.000 0.000 0.520 0.480 0.000
P5 0.000 0.000 0.000 0.000 0.000 1.000
133
6.3 BEHAVIOR UNDERSTANDING MODEL
The constructed PGAG can be effectively used for analyzing the
frame level action dynamics in the form of key posture transitions. A human
behavior understanding model is simulated, which predicts the status of the
key postures either as ‘normal’ or ‘abnormal’ using a priori knowledge.
In the training phase, each VQ symbol has been assigned with
unique behavior status as either ‘normal’ or ‘abnormal’. The model notifies
the abnormal behavior in the sequence of events. It also raises an alarm during
an abnormal event, which analyzes the current state and the next state using
PGAG and VQ symbol status (normal / abnormal).
The probabilistic state transitions are described in the form of four
cases which are depicted in Figures 6.3 (a) to 6.3(d).
Case 1: Initial state is normal
Figure 6.3(a) Case 1 of PGAG
Case 2: Current state is normal and the next state is most probably normal
Figure 6.3(b) Case 2 of PGAG
t > 1P0
t = 1
Pj
t > i
Pj
t > i
t > iPi
t = i
Pij > 0.5 i
Pij < 0.5 i
Pii > 0.5 i
134
Case 3: Initial state is abnormal
Figure 6.3(c) Case 3 of PGAG
Case 4: Current state is abnormal and the next state is most probably
abnormal
Figure 6.3(d) Case 4 of PGAG
where, in Case 2 and Case 4, i represents the maximum likelihood of posture
transitions from current posture ‘i’ to any posture ‘j’, i.e. the maximum value
of ith
row in posture transition matrix.
In the testing phase for the video, the behavior status can be plotted
between the number of frames and their posture status indication (where
Normal = 0 and Abnormal =1). Thus, the PGAG based human behavior
understanding model is capable of measuring the probabilistic likelihood of
next state of the posture sequence and generating appropriate alarm for the
concerned authorities in real-time.
6.4 EXPERIMENTS
The proposed human behavior understanding model is
experimented on a public video data set MuHAVi-MAS
t > 1P0
t = 1
Pj
t > i
Pj
t > i
t > iPi
t = i
Pij > 0.5 i
Pij < 0.5 i
Pii> 0.5 i
135
(http://dipersec.king.ac.uk/MuHAVi-MAS/). The silhouette images are in
PNG format and each action combination can be downloaded as a small zip
file (between 1 to 3 MB). Also, the developers of MuHAVi-MAS have added
3 constant characters "GT-" to the beginning of every original image name to
label them as ground truth images. Here, 5 composite action classes such as
‘CA1-walkturnback’, ‘CA2-runstop’, ‘CA3-punch’, ‘CA4- kick’ and
‘CA5-shotguncollapse’ along with manually annotated action status are
available for the corresponding image frames. Also, it contains information
about actor, camera views and sample identity. Thus, the MuHAVi-MAS data
set has enough information to validate the performance of the proposals for
human behavior understanding model using PGAG. Sample frames from the
data set for five composite actions are shown in Figure 6.4.
(a) CA3 - Punch (b) CA4- Kick (c) CA5- ShotGunCollapse
(d) CA1 - WalkTurnBack (e) CA2 – RunStop
Figure 6.4 Sample image frames from MuHAVi-MAS dataset for 5
composite actions
136
This multi-view data set with five cameras contains the ground
truth, which is explicitly represented for each of the composite actions
performed by five actors. Also, these five composite actions have been
logically partitioned into 14 primitive actions as detailed in Table 6.2.
Table 6.2 Detailed specification about 14 primitive actions from
MuHAVi-MAS data set
Composite
action
label
Composite
Action
Primitive
action
labels
Primitive
Actions
Data set size =
No. of samples
x No. of frames
C11WalkRightToLe
ft8 x 72 = 576
C13 TurnBackRight 4 x 61 = 244
C12WalkLeftToRig
ht8 x 86 = 708
CA1 WalkTurnBack
N - Normal
C14 TurnBackLeft 4 x 54 = 216
C9 RunRightToLeft 8 x 66 = 528
C13 TurnBackRight 4 x 52 = 208
C10 RunLeftToRight 8 x 78 = 624
CA2 Run_Stop
N - Normal
C14 TurnBackLeft 4 x 51 = 204
C8 GuardToPunch 16 x 28 = 448CA3 Punch
AN - Abnormal C7 PunchRight 16 x 46 = 736
C6 GuardToKick 16 x 28 = 448CA4 Kick
AN - Abnormal C5 KickRight 16 x 47 = 752
C1 CollapseRight 8 x 84 = 672
C3 StandupRight 8 x 120 = 960
C2 CollapseLeft 8 x 93 = 744
CA5 ShotGunCollapse
AN - Abnormal
C4 StandupLeft 4 x 112 = 448
Based on available ground truth, totally, the data set with 140 video
samples contains 3308 normal frames and 5208 abnormal frames, and as a
whole, 8516 frames are considered for the experimentation. The data set
detailed so far in Table 6.2 is uniformly partitioned into two data sets, namely
Train Set and Test Set. During this partitioning, the number of frames
considered per composite action and the corresponding ground truth for each
frame with status as either ‘normal’ or ‘abnormal’ is mentioned in detail in
Table 6.3.
137
Table 6.3 MuHAVi-MAS data set partitioning
No. of frames per composite actionData Set
CA1 CA2 CA3 CA4 CA5Normal Abnormal
Train set 872 782 592 600 1412 2248 2010
Test set 872 782 592 600 1412 2248 2010
In the training phase, Train set is chosen to understand the behavior
by updating the PGAG. From each action video silhouettes, the TSOC and
COC features are extracted and then vector quantized into 82 key postures.
The recognized key postures are further subcategorized as 39 ‘normal’ and
43 ‘abnormal’ key postures. The detailed categorization of key postures is
listed in Table 6.4.
Table 6.4 Categorization of key postures per composite action
Composite
Action
No. of Key
Postures
VQ
symbols
No. of
normal
postures
No. of
abnormal
postures
WalkTurnBack 15 w1-w15 15 0
Kick 17 k1-k17 3 14
Punch 18 p1-p18 4 14
Runstop 12 r1-r12 12 0
ShotGunCollapse 20 s1-s20 5 15
The recognized 82 key postures represent the nodes of the PGAG,
and the 82 x 82 dimensional posture transition probability matrix is computed,
considering the similarity between the training postures and key posture. The
‘runstop’ action has 12 key postures and their posture transition matrix is
listed in Table 6.5.
138
Table 6.5 Posture transition matrix for ‘runstop’ action using
MuHAVi-MAS data samples
Pij r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12
r1 0.351 0.149 0.020 0.007 0.041 0.027 0.027 0.074 0.074 0.047 0.115 0.068
r2 0.159 0.280 0.098 0.015 0.030 0.023 0.061 0.121 0.091 0.008 0.098 0.015
r3 0.041 0.107 0.339 0.132 0.008 0.116 0.017 0.190 0.000 0.000 0.041 0.008
r4 0.000 0.028 0.135 0.390 0.057 0.199 0.014 0.149 0.000 0.014 0.014 0.000
r5 0.050 0.042 0.034 0.050 0.210 0.084 0.000 0.143 0.025 0.185 0.134 0.042
r6 0.022 0.014 0.065 0.275 0.080 0.348 0.007 0.152 0.000 0.007 0.029 0.000
r7 0.064 0.085 0.213 0.043 0.043 0.021 0.277 0.106 0.064 0.000 0.085 0.000
r8 0.026 0.093 0.144 0.093 0.093 0.113 0.015 0.273 0.005 0.021 0.093 0.031
r9 0.126 0.165 0.000 0.000 0.000 0.010 0.019 0.010 0.505 0.000 0.029 0.136
r10 0.090 0.010 0.000 0.010 0.170 0.000 0.000 0.020 0.030 0.310 0.240 0.120
r11 0.099 0.033 0.007 0.013 0.139 0.033 0.033 0.159 0.040 0.106 0.265 0.073
r12 0.136 0.027 0.009 0.000 0.055 0.018 0.009 0.000 0.109 0.145 0.045 0.445
At any current state of the simulated model with ith
key posture, the
possible next state transition is evaluated using the similarity score ( i). This i
represents the most probable transitions of ith
key posture, which are
highlighted in Table 6.5. Also, the few cells having the values ‘0.000’ imply
that no transitions have occurred between the corresponding postures.
6.4.1 GUI Design for Human Behavior Analysis
The GUI design for human behavior analysis is implemented,
which includes the ‘Browse’ option and display provisions for frame number,
similarity score with the closest key posture, key posture label and current
action status. The ‘Browse’ option is used to select the test video sample for
verifying the model performance. The frame number indicates the current
silhouette being processed in the input video. The similarity score in the range
[0..1], provides the distance measure between current frame and closest key
posture across code book. The posture label displays the VQ symbol based on
the action. The posture type indicates behavior alarm as either ‘normal’ or
‘abnormal’. The simulated model performance for the given video is plotted
as the number of frames versus the status of the posture as either ‘normal’ or
‘abnormal’. The ‘normal’ or ‘abnormal’ status is scaled as 0 or 1 respectively.
139
In Figure 6.5, the summary of results obtained using proposed
PGAG based behavior understanding model is illustrated for the test sample
of ‘kick’ abnormal action. This sample video sequence has 40 frames, out of
which 29 frames are categorized as ‘abnormal’ and 11 frames are accounted
as ‘normal’. According to the annotation provided by the data set, the result
achieved during behavior learning for this sample is 72.5%. Even though at
the action level ‘kick’ is categorized into ‘abnormal’ status, at the frame level
their starting and ending frame sequences exhibit ‘standing’ posture only,
which is a ‘normal’ one. Hence, the proposed model has attained correct
behavior understanding. Similarly, for the second test sample of
‘shotguncollapse’ abnormal action, the results are summarized in Figure 6.6.
This sample consists of 50 frames and out of which 48 frames are categorized
as ‘abnormal’. Thus, the proposed behavior understanding model obtained
96% accuracy. Likewise, for the third test sample of ‘walkturnback’ normal
action as shown in Figure 6.7, the accuracy reported is 95%.
Figure 6.5 GUI based results for ‘kick’ action, where frame number 40
is alarmed as ‘ABNORMAL’ and the unique VQ symbol
from PGAG is 56. Overall performance plot shows out of 40
frames, 29 frames are identified as ‘ABNORMAL’ and
hence most probably ‘ABNORMAL’ status
140
Figure 6.6 GUI based Results for ‘shotguncollapse’ action, where frame
number 50 is alarmed as ‘ABNORMAL’ and the unique VQ
symbol from PGAG is 62. Overall performance plot shows out
of 50 frames, 48 frames are identified as ‘ABNORMAL’ and
hence most probably ‘ABNORMAL’ status
Figure 6.7 GUI based Results for ‘walkturnback’ action, where frame
number 40 is alarmed as ‘NORMAL’ and the unique VQ
symbol from PGAG is 23. Overall performance plot shows out
of 40 frames, 38 frames are identified as ‘NORMAL’ and
hence most probably ‘NORMAL’ status
141
6.4.2 Performance Analysis
The model is evaluated for predicting the ‘normal’ or ‘abnormal’
posture status using a test set of 56 video samples with 4258 frames. The test
outcome can be either ‘1’ i.e. predicting that the human has performed
‘abnormal’ action or ‘0’ i.e. predicting that the human has performed ‘normal’
action.
TP - true positives (abnormal, correctly declared as abnormal)
TN - true negatives (normal, correctly declared as normal)
FP - false positives (normal, incorrectly declared as abnormal)
FN - false negatives (abnormal, incorrectly declared as
normal)
The performance is measured based on the following metrics:
Accuracy – Proportion of true results in the result set.
Accuracy (Ac) =FNFPTNTPof#
TNof#TPof# (6.3)
Precision – Proportion of true positive against all positive
results.
Precision (Pr) =FPTPof#
TPof# (6.4)
Sensitivity – Proportion of actual positives which are correctly
identified as such. This is also called as ‘recall rate’.
Sensitivity (Sen) = 100XFNTPof#
TPof# (6.5)
142
Specificity – Proportion of negatives which are correctly
identified.
Specificity (Spe) = 100XFPTNof#
TNof# (6.6)
The performance of the behavior model for the training and test
data set using the PGAG approach is reported in Table 6.6 and Table 6.7
respectively.
Table 6.6 Performance analysis of human behavior model on training set
No. of
framesData Set
N AN
FP FN TP TN Ac PrSen
(%)
Spe
(%)
Train Set 2248 2010 319 317 1691 1931 0.85 0.84 84 86
In Table 6.6, the performance of the behavior model is obtained for
the Train set with 4258 video frames. The proposed work has correctly
categorized the ‘normal’ status with 86% specificity and categorized correctly
the ‘abnormal’ status with 84% sensitivity. Hence, out of 2010 ‘abnormal’
frames, 1691 are correctly recognized, similarly out of 2248 ‘normal’ frames,
1931 are correctly recognized and obtained 85% accuracy of results.
Table 6.7 Performance analysis of human behavior model on test set
No. of
framesData Set
N AN
FP FN TP TN Ac PrSen
(%)
Spe
(%)
Test Set 2248 2010 341 329 1669 1919 0.81 0.83 84 85
143
In Table 6.7, the performance of the behavior model for the Test set
with 4258 unknown video frames is considered. The proposed work has
correctly categorized the ‘normal’ status with 85% specificity and correctly
categorized the ‘abnormal’ status with 84% sensitivity. Hence, out of 2010
‘abnormal’ frames, 1669 are correctly recognized, similarly out of 2248
‘normal’ frames, 1919 are correctly recognized and obtained 85% accuracy of
results.
For both the results reported in Table 6.6 and Table 6.7, the reason
for getting only around 84% is due to the improper ground truth information.
Based on ground truth, the actions ‘kick’, ‘punch’ and ‘shotguncollapse’ have
been marked as ‘abnormal’. But, as per the human visual perception, even
though they are marked as ‘abnormal’ actions, the initial point and end point
of 13% of the action frames are considered as ‘normal’ only. Hence, the result
reported has less accuracy in terms of sensitivity and specificity.
The probabilistic behavior model is well structured and
implemented with real time data. The performance shows that the system is
highly reliable for behavior analysis.
6.5 SUMMARY
The main contribution in this chapter is to simulate the human
behavior understanding model for real-time environment. This objective is
analyzed and state space approach is formulated. The PGAG has been
proposed to learn the action dynamics at the frame level. The ultimate
purpose of the system is to predict the behavior status either as ‘normal’ or
‘abnormal’. The human behavior model is designed and experimented using
multi-viewed data set with train set of 4258 frames and test set of 4258 frames
put together 8516 frames. The system is evaluated with four metrics. The
performance results have achieved 86% specificity and 84% sensitivity for
144
train set. Similarly for the test set, the system achieved 85% specificity and
84% sensitivity. The simulated behavior understanding model can analyze
video contents and recognize human postures and status of the actions well in
advance. This proposed model can be effectively utilized to ease the real
world scenarios where behavior understanding is a complex task.
The forthcoming chapter presents concluding remarks and
summarizes the findings of this research work. Also, the future avenues for
further extension are highlighted.