Date post: | 26-Dec-2015 |
Category: |
Documents |
Upload: | arpitjain2811 |
View: | 196 times |
Download: | 2 times |
Human Activity Recognition using Temporal Templates
and Unsupervised Feature-Learning Techniques
A project report submitted
to
MANIPAL UNIVERSITY
For Partial Fulfillment of the Requirement for the
Award of the Degree
of
Bachelor of Engineering
in
Information Technology
by
Arpit Jain
Reg. No. 090911345
Satakshi Rana
Reg. No. 090911281
Under the guidance of
Dr. Sanjay Singh
Associate Professor
Department of Information and Communication Technology
Manipal Institute of Technology
Manipal, India
May 2013
I dedicate my thesis to my parents for their support and
motivation.
Arpit Jain
For my parents and my sister, Shreyaa.
Satakshi Rana
i
DECLARATION
I hereby declare that this project work entitled Human Activity Recog-
nition using Temporal Templates and Unsupervised Feature-Learning
Techniques is original and has been carried out by us in the Department of In-
formation and Communication Technology of Manipal Institute of Technology,
Manipal, under the guidance of Dr. Sanjay Singh, Associate Professor,
Department of Information and Communication Technology, M. I. T., Mani-
pal. No part of this work has been submitted for the award of a degree or
diploma either to this University or to any other Universities.
Place: Manipal
Date :24-05-2013
Arpit Jain
Satakshi Rana
ii
CERTIFICATE
This is to certify that this project entitled Human Activity Recogni-
tion using Temporal Templates and Unsupervised Feature-Learning
Techniques is a bonafide project work done by Mr. Arpit Jain and Ms.
Satakshi Rana at Manipal Institute of Technology, Manipal, independently
under my guidance and supervision for the award of the Degree of Bachelor of
Engineering in Information Technology.
Dr. Sanjay Singh Dr.Preetham Kumar
Associate Professor Head
Department of I & CT Department of I & CT
Manipal Institute of Technology Manipal Institute of Technology
Manipal, India Manipal,India
iii
ACKNOWLEDGEMENTS
We hereby take the privilege to express our gratitude to all the people who were
directly or indirectly involved in the execution of this work, without whom
this project would not have been a success. We extend our deep gratitude
to the lab assistants and Vinay sir for their co-operation and in helping us
download dataset for our project. We thank Dr. Sanjay Singh for his timely
suggestions & guidance and Dr. Preetham Kumar for being a constant source
of inspiration. Our hearty thanks to Rahul, our classmates and teachers who
have supported us in all possible ways. Our deepest gratitude to our family
and friends who have been supportive all the while.
iv
ABSTRACT
Computer vision includes methods for acquiring, processing, analyzing, and
understanding images. Applications of this field include detecting events, con-
trolling processes, navigation, modelling objects or environments, automatic
inspection and many more. Activity recognition is one of the applications
of computer vision that aims to recognize the actions and goals of one or
more agents from a series of observations on the agents actions and the en-
vironmental conditions. The goal of the project is to train an algorithm to
automate detection and recognition of human activities performed in the video
data. The project can be utilized in scenarios such as surveillance systems,
intelligent environment, sports play analysis and web based video retrieval.
In the first phase of the project, actions like bending, side galloping and
hand wave were recognized using a Temporal Template matching technique
called Motion History Image(MHI) methodology. MHI method is exten-
sively being used to represent the history of temporal changes involved in the
execution of an activity. Here the intensity of the pixels of a scalar valued
image are varied, depending upon the motion history. MHI suffers from a
very prominent drawback that when self-occluding or over-writing activities
are encountered it fails to generate a vivid description of the course of that
activity. This happens because when repetitive activities are performed, the
current action overwrites or deletes the information of the previous action. In
the project, we successfully devised and implemented a methodology which
overcame this problem of conventional MHI. In our mechanism, we have uti-
lized red, green and blue channels of 3 different 3-channeled images to rep-
resent human movement. In the methodology used, feature extraction of the
Motion History Image and Motion Energy Image was performed using 7-Hu
Moment and thereafter training and classification was performed using k-NN
Algorithm.
v
The second phase of the project focusses on a recent development in ma-
chine learning, known as “Deep Learning” which works on the concept of
biologically inspired computer vision to achieve the task of unsupervised
feature-learning. Stacked convolutional Independent-Subspace Analy-
sis is used for unsupervised feature extraction on Hollywood2 dataset. The
features are then utilized to classify videos in the dataset in their appropriate
class using Support Vector Machine(SVM). The system is capable of recogniz-
ing 12 different activities performed by humans.
vi
Contents
Acknowledgements iv
Abstract v
List of Tables ix
List of Figures xi
Abbreviations xi
Notations xiii
1 Introduction 1
1.1 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Report Organisation . . . . . . . . . . . . . . . . . . . . . . . 5
2 Recognizing Human Activities using Temporal Templates 6
2.1 Motion-Energy Image(MEI) . . . . . . . . . . . . . . . . . . . 7
2.2 Motion History Image(MHI) . . . . . . . . . . . . . . . . . . . 8
2.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Training and Classification . . . . . . . . . . . . . . . . . . . . 12
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 3-Channel Motion History Images 15
3.1 Failure of MHI in case of Repetitive Actions . . . . . . . . . . 15
vii
3.2 Improved MHI Methodology . . . . . . . . . . . . . . . . . . . 16
3.3 Proposed Algorithm & Results Obtained . . . . . . . . . . . . 20
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Unsupervised Feature Learning for Activity Recognition 25
4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 PCA Whitening . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 Independent Subspace Analysis . . . . . . . . . . . . . . . . . 27
4.4 Stacked ISA for Video Domain . . . . . . . . . . . . . . . . . . 29
4.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5.2 Framework for Classification . . . . . . . . . . . . . . . 31
4.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 Snapshots 35
6 Conclusion and Future Work 40
References 42
ProjectDetail 45
viii
List of Tables
2.1 Action Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1 Hollywood2 Action Dataset . . . . . . . . . . . . . . . . . . . 31
4.2 Performance of System . . . . . . . . . . . . . . . . . . . . . 34
ix
List of Figures
1.1 Approaches for Activity Recognition . . . . . . . . . . . . . . 3
2.1 MEI for Hand Wave . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Different cases of MHI formation for Hand Waving Activity . . 8
2.3 Feature Extraction from MHI and MEI . . . . . . . . . . . . . 12
3.1 MHI Failure: In case of Repetitive Action . . . . . . . . . . . 16
3.2 Initially when no activity is performed . . . . . . . . . . . . . 18
3.3 Activity starts for the first time . . . . . . . . . . . . . . . . . 19
3.4 Hands raised in the anti-clockwise direction . . . . . . . . . . 19
3.5 Actions performed in the clockwise direction . . . . . . . . . . 19
3.6 Hand raised upwards for the second time . . . . . . . . . . . . 20
3.7 Graph 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.8 Graph 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 ISA Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Stacked ISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Stacked ISA for video data . . . . . . . . . . . . . . . . . . . . 30
5.1 Graphical User Interface . . . . . . . . . . . . . . . . . . . . . 35
5.2 Bending Activity . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 MHI and MEI for Bending Activity . . . . . . . . . . . . . . . 36
5.4 Side Galloping Activity . . . . . . . . . . . . . . . . . . . . . . 37
5.5 MHI and MEI for Side Galloping Activity . . . . . . . . . . . 37
5.6 Training Completion . . . . . . . . . . . . . . . . . . . . . . . 37
x
5.7 Test Bending Activity . . . . . . . . . . . . . . . . . . . . . . 38
5.8 Test Side Galloping Activity . . . . . . . . . . . . . . . . . . . 38
5.9 Actions in Hollywood2 Dataset . . . . . . . . . . . . . . . . . 38
5.10 Compute Features . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.11 Classification Results . . . . . . . . . . . . . . . . . . . . . . . 39
xi
ABBREVIATIONS
MHI Motion History Image
MEI Motion Energy Image
ISA Independent Subspace Analysis
SVM Support Vector Machine
CNN Convolutional Neural Network
PCA Principal Component Analysis
xii
NOTATIONS
I(x, y) : Pixel value at (x, y) location
I(x, y, t) : Pixel value of the tth frame at (x, y) location
Mij : Moment of order (i+ j)
ηij : Central moment of order (i+ j)
xi : ith training sample
λi : ith eigenvalue
U : Left-singular vector after singular value decomposition
xiii
Chapter 1
Introduction
Computer vision is a field that includes methods for acquiring, processing,
analyzing, and understanding images and, in general, high-dimensional data
from the real world in order to produce numerical or symbolic information,
e.g., in the forms of decisions [1]. A theme in the development of this field
has been to duplicate the abilities of human vision by electronically perceiving
and understanding an image. Applications range from tasks such as industrial
machine vision systems which say, inspect bottles speeding by on a production
line, to research into artificial intelligence and computers or robots that can
comprehend the world around them. Computer vision is concerned with the
theory behind artificial systems that extract information from images. The
image data can take many forms, such as video sequences, views from multiple
cameras, or multi-dimensional data from a medical scanner. The classical
problem in computer vision, is that of determining whether or not the image
data contains some specific object, feature, or activity.
Activity recognition is one of the sub-fields of computer vision that
aims to recognize the actions and goals of one or more agents from a series
of observations on the agents’ actions and the environmental conditions [2].
Human activity recognition focusses on accurate detection of human activities
based on a pre-defined activity model. It can be exploited to great societal
1
benefits, especially in real life, human-centric applications such as eldercare
and healthcare [3].
1.1 Literature Survey
The various approaches for recognizing human activities are classified into Sin-
gle Layered Approaches and Hierarchical Approaches [4]. Single-layered ap-
proaches are further classified into two types depending on how they model hu-
man activities: i.e., space time approaches and sequential approaches. Space-
time approaches view input as a 3-D(XYT) volume, while sequential ap-
proaches view and interpret the input video as a sequence of observations.
Space time approaches are further divided into three categories based on the
features they use from the 3-D space-time volumes: space-time volumes, tra-
jectories and local interest point detectors. Sequential approaches are classified
depending on whether they use exemplar-based recognition methodologies or
model-based recognition methodologies. In single layered approaches, each
activity corresponds to a cluster containing image sequences for that activity.
These clusters are categorized into classes each having some definite prop-
erty, so when an input is given to the system various algorithms and method-
ologies like neighbour-based matching [5], template matching [6], statistical
modelling [7] algorithm are applied to categorize the input activity into its
appropriate class.
Hierarchical Approaches works on the concept of divide and conquer; in
which any complex problem can be solved by dividing it into several sub prob-
lems. The sub-activities are used to identify the main complex activity. These
approaches are classified on the basis of the recognition methodologies they
use: statistical approach, syntactic approach and description-based approach.
Statistical approaches construct statistical state-based models like layered Hid-
den Markov Model(HMM) to represent and recognize high level human activi-
2
Figure 1.1: Approaches for Activity Recognition
ties [8]. Similarly syntactic approach use a grammar syntax such as a stochastic
context free grammar(SCFG) to model sequential activities [9]. Description-
based approach represent human activities by describing sub-events of the
activities and their temporal, spatial and logical structures [10]. Fig. 1.1 sum-
marizes the hierarchical approach based taxonomy of the approaches used in
human activity recognition.
Ke et al. [11] used segmented spatio-temporal volumes to model human
activities, their system applied a hierarchical mean-shift to clusters of similar
coloured voxels, and obtain several segmented volumes. The motivation is
to find the actor volume segments automatically and to measure their simi-
larity to the action model. Recognition is done by searching for a subset of
over-segmented spatio-temporal volumes that best matches the space of the
action model. Their system recognized simple actions such as hand waving
and boxing from the KTH action database.
Laptev [12] recognized human actions by extracting sparse spatio-temporal
interest points from videos. They extended the local feature detectors(Harris)
commonly used for object recognition, in order to detect interest points in
space-time volume. Motion patterns such as change in direction of object;
splitting and merging of an image structure; and collision/bouncing of object
are detected as a result. In their work, these features were used to distinguish
a walking person from complex backgrounds.
Bobick and Davis [13] constructed a real time action recognition system
3
using template matching. Instead of maintaining the 3-dimensional space-time
volume of each action, they represented each action with a template composed
of two 2-dimensional images: binary Motion Energy Image and scalar Motion
History Image. The two images are constructed from a sequence of foreground
images, which essentially are weighted 2-D projections of the original 3-D
(XYT) volume. By applying a traditional template matching technique to
a pair of (MEI,MHI), their system was able to recognise simple actions
like sitting, arm waving and crouching. However the MHI method suffers
from a serious drawback that when self-occluding or overwriting actions are
encountered it leads to severe recognition failure [14]. This failure results
because when repetitive action is performed the same pixel location is accessed
multiple times due to which the previously stored information in the pixel
gets overwritten or deleted by the current action. In order to address this
issue we implemented a novel technique for creating motion history images
that are capable of representing self-occluding and overwriting action. Our
methodology overcomes the limitation of representing repetitive activities and
thus outshines the conventional MHI method.
Previous work on action recognition has focussed on adapting hand-designed
local features, such as SIFT [15] or HOG [16], from static images to video do-
main. Andrew Y Ng et al. [17] proposed a methodology to learn features
directly from video data using unsupervised feature learning. They used an
extension of the Independent Subspace Analysis algorithm to learn invari-
ant spatio-temporal features from unlabelled video data. By replacing hand-
designed features with learnt features they achieved classification results su-
perior to all various published results on Hollywood2 [18], UCF [19], KTH [20]
and YouTube [21] action recognition dataset. In our project we have worked on
Deep Learning techniques such as stacking and convolution to learn hierarchi-
cal representations. The hierarchical features are extracted using unsupervised
learning technique to classify 12 categories of human activities using Support
4
Vector Machines.
1.2 Report Organisation
The rest of the report is organized as follows:
• Chapter 2: Gives a detailed description of activity recognition using a
temporal template matching technique, i.e. Motion History Image(MHI)
methodology. Construction of MHI and MEI images, feature extraction
and training of the algorithm on Weizmann dataset are shown for classi-
fication of an input video sequence as a bending or side galloping activity.
• Chapter 3: The drawbacks of MHI method are discussed and a novel
approach to overcome the drawbacks is proposed. The algorithm for the
proposed approach and simulated results are provided.
• Chapter 4: Gives an overview of previous hand-coded approaches used
for activity recognition. Thereafter activity recognition for Hollywood2
Action Dataset is discussed using recent approaches on Unsupervised
Feature Learning. Stacked Independent Subspace Analysis(ISA) is trained
for 12 different classes of activities and framework for classification is
provided. Results obtained and comparison with previous results is pro-
vided.
• Chapter 5: Snapshots of the GUI and the system’s performance results
are provided.
• Chapter 6: Concludes the report and gives the future work.
5
Chapter 2
Recognizing Human Activities
using Temporal Templates
Motion History Image(MHI) methodology is a representation and recognition
theory that decomposed motion-based recognition into first describing where
there is motion (the spatial pattern) and then describing how the motion
is moving. In the methodology a binary Motion-Energy Image(MEI) and a
scalar-valued Motion-History Image(MHI) are constructed. MEI represents
where motion has occurred in an image sequence and MHI depicts how the
motion has occurred. Taken together, the MEI and MHI can be considered
as a two component version of a temporal template, a vector-valued image
where each component of each pixel is some function of the motion at that
pixel location. These view-specific templates are matched against the stored
models of views of known as movements.
The initial step for the construction of MEI and MHI involves background
subtraction and obtaining silhouettes. So the captured image frame from the
video is first converted into a grey scale image which is done by using Eq.
2.1. Here the pixel value Y ′ of the grey image is determined using the RGB
components of original image.
Y ′ = 0.2126R + 0.7152G+ 0.0722B (2.1)
6
At each step, the current grey image (src1) is compared with the previous
image (src2). This is done to find the difference of the two images which is
stored in (dst), to identify the changes that help in detecting the motion in
the video. The difference is calculated using Eq. 2.2:
dst(i)c = |src1(I)c − src2(I)c| (2.2)
Then binary threshold is applied for the construction of the silhouettes which
is the outline of the image using Eq. 2.3:
dst(x, y) =
maxvalue if src(x, y) > threshold,
0 if otherwise
(2.3)
2.1 Motion-Energy Image(MEI)
Motion-Energy Images(MEI) are binary cumulative motion images that are
computed from the start frame to the last frame of the video sequence. MEI
represents the region where movement occurred in a video data. Let I(x, y, t)
be an image sequence and let D(x, y, t) be a binary image sequence indicating
regions of motion then, the binary MEI is defined as :
Eτ (x, y, t) =τ−1⋃i=0
D(x, y, t− i) (2.4)
The duration of τ is critical in defining the temporal extent of a movement.
Figure 2.1 shows the MEI image for a hand waving activity.
Figure 2.1: MEI for Hand Wave
7
2.2 Motion History Image(MHI)
In motion history images, the intensity of each pixel is varied based on how
early an activity was performed. The more recent is the activity; higher will
be its intensity. This variation of intensities helps in finding the direction of
motion and the course of action. For the formation of the MHI image each
pixel’s intensity at location (x, y) at time t is given by:
MHI(x, y, t) =
ts if silhouette(x, y, t) 6= 0
0 if silhouette(x, y, t) = 0 &
MHI(x, y, t− 1) < (ts − d)
MHI(x, y, t− 1) otherwise
(2.5)
Here in Eq. 2.5:
MHI(x, y, t): value of pixel at location (x, y) at time t
ts: timestamp which denotes the current time
d: duration for which motion is tracked
Every pixel of image which is traced will satisfy one of the four cases as de-
scribed below:
Figure 2.2: Different cases of MHI formation for Hand Waving Activity
8
• Case 1: Point 1 in Fig. 2.2 corresponds to the pixel where motion has
just occurred in the current silhouette frame i.e silhouette(x, y, t) 6= 0.
Therefore the corresponding pixel in the MHI image is set to the value
of timestamp i.e. MHI(x, y, t) = timestamp .
• Case 2: Point 2, represents the pixel where motion has not occurred in
the current silhouette frame i.e silhouette(x, y, t) = 0 and whose pixel
value at time t-1 is less than (timestamp-duration) i.e. MHI(x, y, t −
1) < (timestamp − duration) and therefore the corresponding pixel in
the MHI image is set to 0.
• Case 3: In the third point, there was no motion in the current silhou-
ette frame, however its previous value (MHI(x, y, t−1)) is not less than
(timestamp-duration) and therefore the corresponding pixel’s previous
value is retained in the MHI image. But as the timestamp is increasing
continuously, the intensity of this pixel is relatively less than point 1.
• Case 4: In point 4, motion never occurred throughout the execution
of the activity i.e silhouette(x, y, t) = 0 for all values of timestamp ts.
Therefore the corresponding pixel’s value in the MHI image is set to 0.
2.3 Feature Extraction
Transforming the input data into the set of features is called feature extraction.
If the features extracted are carefully chosen it is expected that the features set
will extract the relevant information from the input data in order to perform
the desired task using this reduced representation instead of the full size input.
Feature extraction involves simplifying the amount of resources required to
9
describe a large set of data accurately. When performing analysis of complex
data one of the major problems stems from the number of variables involved.
Analysis with a large number of variables generally requires a large amount of
memory and computation power or a classification algorithm which over-fits
the training sample and generalizes poorly to new samples. Feature extraction
is a general term for methods of constructing combinations of the variables
to get around these problems while still describing the data with sufficient
accuracy.
The Statistical descriptions of a set of MEIs and MHIs images for each
view/movement combination is given to the system using moment-based fea-
tures. An image moment is a certain particular weighted average (moment)
of the image pixels’ intensities, or a function of such moments, usually chosen
to have some attractive property or interpretation. Simple properties of the
image which are found via image moments include area (or total intensity), its
centroid, and information about its orientation. Image with pixel intensities
I(x,y), raw image moments Mij are calculated by:
Mij =∑x
∑y
xiyjI(x, y) (2.6)
Central moments are defined as:
µpq =∑x
∑y
(x− x)p(y − y)qf(x, y) (2.7)
Where:
x =M10
M00
(2.8)
y =M01
M00
(2.9)
Moments ηij where i + j ≥ 2 can be constructed to be invariant to both
translation and changes in scale by dividing the corresponding central moment
by the properly scaled (00)th moment, using the following formula:
ηij =µij
µ(1+ i+j
2 )00
(2.10)
10
It is possible to calculate moments which are invariant under translation,
changes in scale, and also rotation. Most frequently used are the Hu set of
invariant moments [22]:
I1 = η20 + η02 (2.11)
I2 = (η20 − η02)2 + 4η211 (2.12)
I3 = (η30 − 3η12)2 + (3η21 − η03)2 (2.13)
I4 = (η30 + η12)2 + (η21 + η03)
2 (2.14)
I5 = (η30 − 3η12)(η30 + η12)[(η30 + η12)2 − 3(η21 + η03)
2]+ (2.15)
(3η21 − η03)(η21 + η03)[3(η30 + η12)2 − (η21 + η03)
2] (2.16)
I6 = (η20 − η02)[(η30 + η12)2 − (η21 + η03)
2] + 4η11(η30 + η12)(η21 + η03)
(2.17)
I7 = (3η21 − η03)(η30 + η12)[(η30 + η12)2 − 3(η21 + η03)
2]− (2.18)
(η30 − 3η12)(η21 + η03)[3(η30 + η12)2 − (η21 + η03)
2] (2.19)
Using these 7-Hu moments, seven features are extracted from both MHI
and MEI images. The obtained vector of 1 × 14 dimension is given to the
system for training and classification purpose as shown in Fig. 2.3.
2.4 Dataset
The system was trained and tested on Weizmann Dataset [23] and the dataset
which was made in the Information and Communication Department of MIT,
Manipal. Weizmann Dataset contains seven training videos for bending and
side galloping activities. Also each activity has one test video. The ICT
dataset contain three training videos and one test video for each activity. Total
number of training videos are 20 and 4 test videos for two classes of actions
namely bending and side galloping. The training data used is summarized in
Table 2.1.
11
Figure 2.3: Feature Extraction from MHI and MEI
Table 2.1: Action Dataset
Weizmann Dataset
Training Videos Test Videos
Bending 7 1
Side Galloping 7 1
Manipal Dataset
Training Videos Test Videos
Bending 3 1
Side Galloping 3 1
All Samples 20 4
2.5 Training and Classification
Supervised learning is the machine learning task of inferring a function from
labeled training data. The training data consist of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory sig-
nal). A supervised learning algorithm analyzes the training data and produces
12
an inferred function, which can be used for mapping new examples. In the
project the training examples that are provided to the system are in the form
of n × 14 matrix, where n is the number of training videos provided to the
system and 14 are the features for MHI and MEI images corresponding to each
video. Along with this a n× 1 matrix is provided to the system that specifies
the class for each training video given to the system. Therefore, corresponding
to each training video, 14 features of the video and the class label of the video
is given to the system.
In pattern recognition, the k-nearest neighbour algorithm (k-NN) is a non-
parametric method for classifying objects based on closest training examples
in the feature space [24]. An object is classified by a majority vote of its
neighbours, with the object being assigned to the class most common amongst
its k nearest neighbours (k is a positive integer, typically small). If k = 1,
then the object is simply assigned to the class of its nearest neighbour.
When an unlabelled input video is given to the system, MHI and MEI
images are constructed corresponding to the video. From these MHI and
MEI images, features are extracted using 7-Hu Moments. The system then,
computes Euclidean Distance between the features of the input image and the
features of labelled training videos given to the system using Equation:
d(p, q) = d(q, p) =√
(q1 − p1)2 + (q2 − p2)2 + · · ·+ (qn − pn)2 (2.20)
Given the euclidean distance between the input unlabelled video and the train-
ing videos, training video are arranged in the ascending order of the distance
calculated. From these, voting algorithm is applied on the top k training videos
to determine the class of the unlabelled input video sequence. The system is
trained on Weizmann dataset and tested on real time video. The system is
successfully able to recognize bending and side-galloping activities.
13
2.6 Summary
The chapter provided a detailed overview of MHI methodology, feature extrac-
tion, training and classification modules involved in the activity recognition
from video sequences. Training was performed on Weizmann dataset and test-
ing was done for real-time videos. The system was made capable of recognizing
bending and side galloping activities.
14
Chapter 3
3-Channel Motion History
Images
Motion History Image(MHI) methodology is a very popular method used in ac-
tivity recognition. However, it suffers from a very serious drawback, i.e. it fails
when repetitive and over-writing activities are encountered. In this chapter
we demonstrate the failure of MHI method and provide a novel methodology
which has overcome the drawbacks of conventional MHI method. Algorithm
of our proposed methodology and simulated results are given.
3.1 Failure of MHI in case of Repetitive Ac-
tions
In the case of repetitive actions, there will be two or multiple intensities stored
at a single location. This result in overwriting and deletion of previous in-
tensity stored at a particular pixel. The same is shown in Fig. 3.1 where
waving action is being performed first in the anti-clockwise direction then in
the clockwise direction. The resulting figure does not leads to formation of
any conclusion as the previous information of the anti-clockwise movement
has been erased by the clockwise movement. This proves that MHI com-
15
pletely fails when overwriting and self- occluding actions are being performed.
This drawback of the MHI method cannot be over looked because overwriting
Figure 3.1: MHI Failure: In case of Repetitive Action
and self-occluding actions are very common and are encountered a number of
times in activity recognition. To address this issue in MHI, we have devised
a methodology which will not only characterize how the motion occurred but
will also be able to represent repetitive activities successfully.
3.2 Improved MHI Methodology
In our mechanism the initial steps of background subtraction and formation
of silhouettes remain same as the normal MHI method. However instead of
using a scalar valued image as in the case of normal MHI, we have used red,
green and blue channels of three different 3-channel images. A single frame of
the activity video is taken and three different images namely MHIR, MHIG,
MHIB are computed. Initially, when no activity is being performed, the pixel
intensities of all the three images are null as shown in Fig. 3a-3c. Now to find
the motion history images, each pixel of the silhouette frame is examined with
the corresponding pixel of MHIR, MHIG and MHIB. When motion occurs at a
particular pixel location of the silhouette frame, the corresponding pixel value
of MHIR, MHIG, MHIB are checked. If the value is found to be null for all
the channels, it implies that the motion has occurred for the first time at that
pixel location. In that case, motion is represented using the red channel of
MHIR as in Fig. 3d-3i where MHIR, MHIG, MHIB respectively are shown for
16
the movement of an arm upwards in the anti-clockwise direction. Here each
pixels RGB value of MHIR is determined in accordance with equations:
R(x, y, t) =
ts if silhouette(x, y, t) 6= 0
0 if silhouette(x, y, t) = 0 &
R(x, y, t− 1) < (ts − d)
R(x, y, t− 1) otherwise
(3.1)
G(x, y, t) = 0 (3.2)
B(x, y, t) = 0 (3.3)
In the case when motion occurs at a particular pixel location in the sil-
houette frame and the corresponding pixel’s red channel value of MHIR is
greater than a specified threshold but blue and green channel values of MHIB
and MHIG respectively are null. This suggests that a repetitive action is per-
formed at that pixel location. Therefore, we represent that action using the
green channel of MHIG and MHIR remains unchanged. So in this way, there is
no overwriting of the previous action in the MHIR by the current action. Fig.
3j-3l shows MHI images when an action is performed first in the anticlockwise
direction and then in the clockwise direction. Each pixel’s RGB value of MHIG
is varied according to equations:
R(x, y, t) = 0 (3.4)
G(x, y, t) =
ts if silhouette(x, y, t) 6= 0
0 if silhouette(x, y, t) = 0 &
G(x, y, t− 1) < (ts − d)
G(x, y, t− 1) otherwise
(3.5)
B(x, y, t) = 0 (3.6)
17
If a repetitive action is further continued for the second time at a pixel
location in the silhouette frame then the corresponding pixel’s green channel
value of MHIG will not be null. Then in such a situation the blue channel of
MHIB is utilized to represent this action and MHIR & MHIG are unchanged.
Therefore the previous actions in MHIR and MHIG are not over-written by
the current action in MHIB. Fig. 3m-3o shows MHI images when an action
is performed in the anticlockwise direction as depicted in MHIR, then in the
clockwise direction as shown in MHIG and finally again in the anti-clockwise
direction as shown in the MHIB. Each pixels RGB value of MHIB is determined
using the equations:
R(x, y, t) = 0 (3.7)
G(x, y, t) = 0 (3.8)
B(x, y, t) =
ts if silhouette(x, y, t) 6= 0
0 if silhouette(x, y, t) = 0 &
B(x, y, t− 1) < (ts − d)
B(x, y, t− 1) otherwise
(3.9)
Therefore our methodology is effective in presenting not only simple non repet-
itive actions but also complex self-occluding and overwriting activities.
MHIR
Fig 3a
MHIG
Fig 3b
MHIB
Fig 3c
Figure 3.2: Initially when no activity is performed
18
Fig 3d Fig 3e Fig 3f
Figure 3.3: Activity starts for the first time
Fig 3g Fig 3h Fig 3i
Figure 3.4: Hands raised in the anti-clockwise direction
Fig 3j Fig 3k Fig 3l
Figure 3.5: Actions performed in the clockwise direction
19
Fig 3m Fig 3n Fig 3o
Figure 3.6: Hand raised upwards for the second time
3.3 Proposed Algorithm & Results Obtained
Graph 1 shows the number of active pixels per frame for MHIR, MHIG, MHIB
here the number of active pixels implies the count of pixels whose value is not
null. When the hand moves upwards the number of active pixels per frame
of MHIR image increases continuously and its corresponding graph increases
whereas the graphs for MHIG and MHIB continue to remain null. Then when
the hand stops for some time and there is no motion performed, the graph of
MHIR decreases as the count of active pixels decreases in accordance with the
Eq. 3.1. Thereafter when the hand moves downwards, MHIR image becomes
constant since a repetitive action is encountered and this new action is rep-
resented in MHIG. The graph of MHIG increases till the action is performed
in the downward direction, and then starts decreasing when the hand stops
for some time. When the hand moves again in the upward direction, MHIG
image becomes constant and so does the graph for MHIG and this new action
is represented in MHIB. The graph for MHIB continuously increases when
hand moves upwards and then starts decreasing when the hand stops.
In Graph 2 the summation of pixel values for a 20 × 20 box centred at
location (287, 193) is plotted for each frame of MHIR, MHIG, MHIB. Initially
when the hand is moving in the upward direction, since motion is occurring
20
Figure 3.7: Graph 1
for the first time the action is represented in MHIR whereas MHIG and MHIB
continue to remain null images. The summation of pixel values of the box for
the MHIR image is initially null implying that the motion is occurring outside
the box. Later the MHIR graph continues to increase with the frame number,
implying that the hand is moving inside the box. At frame no. 33 the graph for
MHIR starts decreasing, this happens due to the fading of the pixels of the box
when the hand has moved out of the box. From the 50th frame the hand starts
moving in the downward direction, and thus MHIR image becomes constant
and this new action is represented in the MHIG image. When this repetitive
action is performed the MHIR graph becomes constant and MHIG continue
to remain null till the movement of hand downwards is not performed inside
the box. At frame no. 207, MHIG graph starts increasing implying that the
motion is occurring inside the box. After frame no. 216, the MHIG graph starts
decreasing due to the fading of the pixels when the hand has moved out of the
box. Thereafter when an action is again performed in the upward direction,
MHIG becomes constant and the new action is represented in MHIB. Here
also MHIB graph remains null till the action is not performed inside the box.
21
Thereafter MHIB graph increases for the frames in which action is performed
inside the box and then MHIB graph decreases due to the fading of the pixels
when the hand has moved out of the box.
Figure 3.8: Graph 2
Algorithm 1 describes the proposed methodology using which the simulated
results are obtained.
22
Algorithm 1 Improved MHI Algorithm
Data: silhouette, MHIR, MHIG, MHIB, timestamp, duration
for x:0 to image.width do
for y:0 to image.height do
valR= MHIR(x,y); valG= MHIG(x,y); valB= MHIB(x,y) if silhou-
ette(x,y) then
if valR=0 & valG = 0 & valB = 0then
valR=timestamp goto setvalue
end
if valR!=0 & valG = 0 & valB = 0then
valG=timestamp goto setvalue
end
if valR!=0 & valG! = 0 & valB = 0then
valB=timestamp goto setvalue
end
else
if valR!=0 & valG = 0 & valB = 0then
valR=valR<(timestamp-duration)? 0:valR goto setvalue
end
if valR!=0 & valG! = 0 & valB = 0then
valG=valG<(timestamp-duration)?0:valG goto setvalue
end
if valR!=0 & valG! = 0 & valB! = 0then
valB=valB<(timestamp-duration)? 0:valB goto setvalue
end
end
setvalue:
MHIR(x,y)=valR; MHIG(x,y)=valG; MHIB(x,y)=valB end
end
23
3.4 Summary
Motion History Image contain necessary details about how an action was per-
formed. Conventional MHI method suffers from many drawbacks. One of
them is failure in the cases of over-writing and self-occluding activities. In this
paper we have presented and implemented a novel technique for the formation
of motion history images which are successful in representing repetitive activ-
ities. In our methodology an action is first represented using MHIR and when
a repetitive action is performed it is represented using MHIG. If a repetitive
action is further continued it is represented in MHIB. Thus in our approach
when a repetitive activity is performed the information of the previous ac-
tion is retained and is not overwritten. In the report we have provided an
algorithm of our approach and have experimentally proved our methodology
through graphs and simulations for repetitive activities.
24
Chapter 4
Unsupervised Feature Learning
for Activity Recognition
Previous work done in the field of activity recognition focussed on hand-coded
feature learning techniques. In the subsequent sections an overview of the
previous approaches and their drawbacks is given and a recent advances in
Unsupervised Feature-Learning are discussed. The system is trained on Hol-
lywood2 Action Dataset using stacked Independent Subspace Analysis. The
features thus learnt, are used for classification using Support Vector Machine.
4.1 Related Work
Previous approaches in activity recognition rely on hand-designed features
such as SIFT and HOG. A weakness of such approaches is that it is difficult
and time-consuming to extend these features to other sensor modalities, such
as laser scans, text or even videos. Most of the current approaches make
certain assumptions (Example: small scale and viewpoint changes) about the
circumstances under which the video was taken. However, such assumptions
seldom hold in the real world environment. In addition, most of the methods
follow a two-step approach in which the first step computes features from
raw video frame and the second step learns classifiers based on the obtained
25
features. In real world scenarios, it is rarely known what features are important
for the task at hand, since the choice of features is highly problem-dependent
[25].
In recent years, low-level hand designed features have been heavily em-
ployed with much success. Typical examples of such successful features for
static images are SIFT, HOG, GLOH and SURF. Extending the above fea-
tures to 3-D is the pre-dominant methodology in video action recognition.
These methods usually have two-stages: an optional feature detection stage
followed by a feature description stage. Well known feature detection meth-
ods are Harris 3-D, Cuboids and Hessian. For descriptors popular methods are
Cuboids, HOG 3-D and extended SURF. Wang et al. [26] combined various
low level feature detection, feature description methods and benchmark their
performance on KTH, UCF Sports Action and Hollywood 2 dataset. They
used the state-of-the-art processing pipeline with vector quantization, feature
normalization and χ2 kernel SVMs. Their most interesting findings was that
there was no universally best hand-engineered feature for all datasets.
Their findings suggest that learning features directly from the dataset itself
may be more advantageous. In our project we have implemented Wang et al.’s
experimental protocols by using their standard processing pipeline and replac-
ing the first stage of feature extraction with an unsupervised feature learning
technique.
4.2 PCA Whitening
The goal of whitening is to make the input less redundant; more formally, our
desiderata are that our learning algorithms sees a training input where
• Features are less correlated with each other.
• Features all have the same variance.
26
Principal component analysis (PCA) is a mathematical procedure that uses
an orthogonal transformation to convert a set of observations of possibly corre-
lated variables into a set of values of linearly uncorrelated variables called prin-
cipal components [27]. PCA is a dimensionality reduction algorithm that can
be used to significantly speed up a unsupervised feature learning algorithm.It
finds a lower-dimensional subspace onto which the data is projected.The new
projected data points are achieved by:
xrot = UTx (4.1)
Where U is the matrix of principal eigenvectors. To reduce the dimensionality
to k dimensions we choose the first k components of xrot. Thus achieving the
first goal of whitening by making the features less correlated.
The second goal of whitening is achieved by rescaling each feature xrot,i by
1/√λi. Concretely the whitened data xPCAwhite is given as:
xPCAwhite,i =xrot,i√λi
(4.2)
where λ is the eigen value.
4.3 Independent Subspace Analysis
Figure 4.1: ISA Network
27
Independent subspace analysis(ISA) [28] is an unsupervised learning algo-
rithm that learns features from unlabelled image patches. An ISA network is a
two-layered network with square and square-root nonlinearities in the first and
the second layers respectively. The weights W in the first layer are learned,
and the weights V of the second layer are fixed. Each of the second layer hid-
den units pools over a small neighbourhood of adjacent first layer units. The
first and second layer units are called simple and pooling units respectively.
Given an input pattern xt the activation of each pooling unit is given by:
pi(xt;W,V ) =
√√√√ m∑k=1
Vik
( n∑j=1
Wkjxtj
)2(4.3)
The parameters W are learnt through finding sparse feature representations
in the second layer by minimizing:
T∑t=1
m∑i=1
pi(xt;W,V ) (4.4)
subject to:
WW T = I (4.5)
where:
W∈ Rk×n
V ∈ Rm×k
{xt}Tt=1: whitened input examples
n: input dimension
k: number of simple units
m: number of pooling units
ISA algorithm is able to learn Gabor filters (Edge detectors) with many fre-
quencies and orientations. Further, it is also able to group similar features in
a group thereby achieving invariances. However, The standard ISA training
algorithm becomes less efficient when input patches are large. Thus, training
this algorithm with high dimensional data, especially video data, takes days to
complete. In order to scale up the algorithm to large inputs, we implemented
28
a convolutional neural network architecture that progressively makes use of
PCA and ISA as sub-units for unsupervised learning. We first trained the
ISA algorithm on small input patches. We then take this learned network and
convolve with a larger region of the input image. The combined responses of
the convolution step are then given as input to the next layer which is also
implemented by another ISA algorithm with PCA as a prepossessing step.
Similar to the first layer, we use PCA to whiten the data and reduce their
dimensions such that the next layer of the ISA algorithm only works with low
dimensional inputs as shown in Fig. 4.2.
Figure 4.2: Stacked ISA
4.4 Stacked ISA for Video Domain
Stacked ISA network can be extended for videos by giving the inputs to the
network as 3D video blocks instead of image patches. More specifically, a
sequence of image patches are flattened into a vector and given to the network
as shown in Fig. 4.3. Finally features from both layers are used as local
features for classification.
29
Figure 4.3: Stacked ISA for video data
4.5 Experimental Setup
In this section we discuss the Hollywood2 dataset used for the evaluation pro-
tocol. The features obtained from the dataset, using unsupervised feature
learning, are utilized for classification of an unlabelled input video. We eval-
uate the features in a bag-of-features based action classification task and use
Support Vector Machine for classification.
4.5.1 Dataset
The system was trained on Hollywood2 human actions dataset. The dataset
consists of 12 classes of human activities distributed over 2517 video clips
and approximately 15.6 hours of video in total. The dataset is composed of
video clips from 69 movies. The training set has two subsets namely: clean
and automatic. Clean training subset has action labels which are manually
verified to be correct. Automatic training subset has noisy action labels which
are collected by means of automatic script-to-video alignment in combination
30
with text based script classification. The dataset has a total of 823 clean
training videos, 810 automatic training videos and 884 test training videos.
Human actions in the dataset are: answer phone, drive car, eat, fight person,
get out car, handshake, hug person, kiss, run, sit down, sit up and stand up.
The performance is evaluated as suggested in [18] i.e., by computing average
precision (AP) for each of the action classes and reporting the mean AP over
all classes. The dataset is summarized in Table 4.1.
Table 4.1: Hollywood2 Action Dataset
Training Subset
(clean)
Training Subset
(automatic)
Test Subset
(clean)
Answer Phone 66 59 64
Drive Car 85 90 102
Eat 40 44 33
Fight Person 54 33 70
Get Out Car 51 40 57
Handshake 32 38 45
Hug Person 64 27 66
Kiss 114 125 103
Run 135 187 141
Sit Down 104 87 108
Sit up 24 26 37
Stand Up 132 133 146
All Samples 823 810 884
4.5.2 Framework for Classification
The features from the test videos are computed from the stacked Indepen-
dent Subspace Analysis(ISA) model as explained in section 4.3. The video
31
sequences are represented as a bag of local spatio-temporal features. In com-
puter vision, the bag-of-words model (BoW model) can be applied to image
classification, by treating image features as words. In document classification,
a bag of words is a sparse vector of occurrence counts of words; that is, a sparse
histogram over the vocabulary. In computer vision, a bag of visual words is
a sparse vector of occurrence counts of a vocabulary of local image features.
BoW model converts vector represented patches to“codewords”. A codeword
can be considered as a representative of several similar patches. One simple
method is performing k-means clustering over all the vectors. Codewords are
then defined as the centres of the learned clusters. The number of the clusters
is the codebook size. Thus, each patch in an image is mapped to a certain
codeword through the clustering process and the image can be represented by
the histogram of the codewords. To increase the precision k-means is initial-
lized three times. For classification, we have used a non-linear Support Vector
Machine [29].
In machine learning, support vector machines (SVMs, also support vector
networks) are supervised learning models with associated learning algorithms
that analyze data and recognize patterns, used for classification and regression
analysis. The basic SVM takes a set of input data and predicts, for each
given input, which of two possible classes forms the output, making it a non-
probabilistic binary linear classifier. Given a set of training examples, each
marked as belonging to one of two categories, an SVM training algorithm
builds a model that assigns new examples into one category or the other. An
SVM model is a representation of the examples as points in space, mapped so
that the examples of the separate categories are divided by a clear gap that
is as wide as possible. New examples are then mapped into that same space
and predicted to belong to a category based on which side of the gap they fall
on. A support vector machine constructs a hyperplane or set of hyperplanes
in a high- or infinite-dimensional space, which can be used for classification,
32
regression, or other tasks. In addition to performing linear classification, SVMs
can efficiently perform non-linear classification using what is called the kernel
trick, implicitly mapping their inputs into high-dimensional feature spaces. In
the non-linear classification the dot product used for the linear classification
is replaced by a non-linear kernel function. In the project we used χ2−kernel
[30]:
K(Hi, Hj) = exp
(− 1
2A
V∑n=1
(hin − hjn)2
hin + hjn
)(4.6)
where Hi = hin and Hj = hjn are the frequency histograms of word occur-
rences, V is the vocabulary size and A is the mean value of distances between
all training samples. Multi-class SVM aims to assign labels to instances by
using support vector machines, where the labels are drawn from a finite set
of several elements. The dominant approach for doing so is to reduce the sin-
gle multi-class problem into multiple binary classification problems. A binary
classifiers is used which distinguishes between one of the labels and the rest
(one-versus-all). Classification of new instances for the one-versus-all case is
done by a winner-takes-all strategy, in which the classifier with the highest
output function assigns the class.
4.5.3 Results
The system was trained on Hollywood2 dataset using unsupervised feature
learning technique and then classified using Support Vector Machines. The
mean average precision, recall, accuracy and F-Score obtained for the system
using non-overlapping dense sampling are summarized in the following Table
4.2. If 50% overlap dense sampling is performed with a machine with 24 Gb
RAM, the results are improved by 8.6 %.
33
Table 4.2: Performance of System
Mean AP Mean Recall Mean Accuracy F-Score
44.776% 79.419% 92.232% 0.573
4.6 Summary
In the chapter previous approaches used in activity recognition and their
drawbacks have been discussed. Recent advances made in the field of neu-
ral networks such as stacked Independent Subspace Analysis for Unsupervised
Feature-learning were presented. Stacked ISA was trained on Hollywood2 Ac-
tion Dataset. The spatio-temporal features obtained were used to recognize 12
categories of human activities. The results obtained showed that Unsupervised
Feature-Learning outperforms many state-of-the-art methods.
34
Chapter 5
Snapshots
Snapshots of the Graphic User Interface for the purpose of demonstrating
recognition of bending and side galloping activities using MHI methodology
and snapshot of the system’s performance result when trained Hollywood2
Action Dataset using Unsupervised Feature-Learning Technique and SVM for
classification is given below:
Start Screen Available Tools
Figure 5.1: Graphical User Interface
35
Figure 5.2: Bending Activity
Motion History Image Motion Energy Image
Figure 5.3: MHI and MEI for Bending Activity
36
Figure 5.4: Side Galloping Activity
Motion History Image Motion Energy Image
Figure 5.5: MHI and MEI for Side Galloping Activity
Figure 5.6: Training Completion
37
New input Classification Result
Figure 5.7: Test Bending Activity
New input Classification Result
Figure 5.8: Test Side Galloping Activity
Figure 5.9: Actions in Hollywood2 Dataset
38
Figure 5.10: Compute Features
Figure 5.11: Classification Results
39
Chapter 6
Conclusion and Future Work
In the report various approaches used for human activity recognition were dis-
cussed and two different approaches for activity recognition were implemented
for recognizing activities performed in a given video sequence. The project was
divided into two phases. The first phase of the project focussed on Temporal
Template Matching techniques for activity recognition. Motion History Im-
age(MHI Methodology) was examined for non repetitive activities and from the
MHI and MEI images generated, features were extracted using 7-Hu Moments.
These features were then used to train the system to classify side galloping
and bending activity by using k-NN algorithm. Thereafter the drawbacks of
MHI methodology were presented and it was shown that MHI method fails
to recognize repetitive and over-writing activities. In order to address this
issue a novel methodology was proposed and implemented and was shown to
be successful when over-writing and repetitive actions are encountered. The
algorithm of the proposed methodology and simulated results have also been
provided for repetitive hand waving action.
The second phase of the project focussed on Unsupervised Feature-
Learning Techniques which is a recent development in the field of machine
learning. Various previous works using hand-designed features for activity
recognition and their drawbacks were discussed. The unsupervised feature
40
learning approach using the concepts of stacked Independent Subspace Anal-
ysis(ISA) and PCA Whitening is implemented for spatio-temporal feature ex-
traction from Hollywood2 Action Dataset. The features learnt are used to
classify an unlabelled test video data into it’s class of activity using Support
Vector Machines. The system was trained and tested for 12 different human
activities.
As part of future work we would like to extend the MHI methodology for
unsupervised feature-extraction techniques. In other words after the gener-
ation of Motion History and Motion Energy Images, instead of using 7-Hu
moments for feature extraction stage, we will use stacked Independent Sub-
space Analysis for learning features. Thus, replacing the hand-coded features
by a more efficient method. Also the k-NN algorithm used in the classification
for MHI method is naive and is difficult to implement for large training data.
Therefore, we will replace k-NN algorithm by a more efficient classification
technique such as Support Vector Machine.
41
References
[1] L. G. Shapiro and G. C. Stockman, Computer Vision. Prentice Hall,
2001.
[2] Wikipedia. Activity recognition. [Online]. Available: http://en.wikipedia.
org/wiki/Activity recognition
[3] E. Kim, S. Helal, and D. Cook, “Human activity recognition and pattern
discovery,” Pervasive Computing, IEEE, vol. 9, no. 1, pp. 48–53, Jan.
2010.
[4] J. Aggarwal and M. Ryoo, “Human activity analysis: A review,” ACM
Computing Surveys (CSUR), vol. 43, pp. 16:1–16:43, april 2011.
[5] A. Yilmaz and M. Shah, “Recognizing human actions in videos acquired
by uncalibrated moving cameras,” in Proceedings of the Tenth IEEE In-
ternational Conference on Computer Vision (ICCV’05), vol. 1. IEEE
Computer Society, 2005, pp. 150–157.
[6] E. Shechtman and M. Irani, “Space-time behavior based correlation,” in
IEEE Conference on Computer Vision & Pattern Recognition (CVPR),
vol. 1. IEEE Computer Society, June 2005, pp. 405–412.
[7] S. M. Khan and M. Shah, “Detecting group activities using rigidity of for-
mation,” in Proceedings of the 13th annual ACM international conference
on Multimedia, 2005, pp. 403–406.
42
[8] N. Nguyen, D. Phung, S. Venkatesh, and H. Bui, “Learning and detect-
ing activities from movement trajectories using the hierarchical hidden
markov model,” in IEEE Computer Society Conference on Computer Vi-
sion & Pattern Recognition CVPR, vol. 2, Jan. 2005, pp. 955–960.
[9] S. W. Joo and R. Chellappa, “Attribute grammar-based event recogni-
tion and anomaly detection,” in IEEE Computer Society Conference on
Computer Vision & Pattern Recognition Workshop CVPRW, Jan. 2006,
pp. 107–112.
[10] A. Gupta, P. Srinivasan, J. Shi, and L. Davis, “Understanding videos,
constructing plots learning a visually grounded storyline model from an-
notated videos,” in IEEE Computer Society Conference on Computer Vi-
sion & Pattern Recognition CVPR, Jan. 2009, pp. 2012–2019.
[11] Y. Ke, R. Sukthankar, and M. Hebert, “Spatio-temporal shape and flow
correlation for action recognition,” in IEEE Computer Society Conference
on Computer Vision & Pattern Recognition (CVPR), June 2007, pp. 1–8.
[12] I. Laptev, “On space-time interest points,” International Journal of Com-
puter Vision, vol. 64, pp. 107–123, Sep. 2005.
[13] A. F. Bobick and J. W. Davis, “The recognition of human movement
using temporal templates,” IEEE Transaction on Pattern Analysis and
Machine Intelligence, vol. 23, pp. 257–267, Mar. 2001.
[14] M. A. R. Ahad, J. K. Tan, H. Kim, and S. Ishikawa, “Motion history im-
age: its variants and applications,” Machine Vision Application, vol. 23,
pp. 255–281, Mar. 2012.
[15] D. G. Lowe, “Object recognition from local scale-invariant features,” in
The Proceedings of the Seventh IEEE International Conference on Com-
puter Vision ICCV, June 1999, pp. 1150–1157.
43
[16] N. Dalal and B. Triggs, “Histograms of oriented gradients for human
detection,” in IEEE Computer Society Conference on Computer Vision
& Pattern Recognition CVPR, July 2005, pp. 886–893.
[17] Q. Le, W. Zou, S. Yeung, and A. Ng, “Learning hierarchical invariant
spatio-temporal features for action recognition with independent subspace
analysis,” in IEEE Computer Society Conference on Computer Vision &
Pattern Recognition CVPR, June 2011, pp. 3361–3368.
[18] M. Marszaek, I. Laptev, and C. Schmid, “Actions in context,” in IEEE
Conference on Computer Vision & Pattern Recognition CVPR, June
2009, pp. 2929–2936.
[19] M. Rodriguez, J. Ahmed, and M. Shah, “Action mach a spatio-temporal
maximum average correlation height filter for action recognition,” in IEEE
Computer Society Conference on Computer Vision & Pattern Recognition
CVPR, May 2008, pp. 1–8.
[20] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: a
local svm approach,” in Proceedings of the 17th International Conference
on Pattern Recognition ICPR, June 2004, pp. 32–36 Vol.3.
[21] J. Liu, J. Luo, and M. Shah, “Recognizing realistic actions from videos
in the wild,” in IEEE Computer Society Conference on Computer Vision
& Pattern Recognition CVPR, June 2009, pp. 1996–2003.
[22] M.-K. Hu, “Visual pattern recognition by moment invariants,” IRE
Transactions on Information Theory, vol. 8, pp. 179–187, Mar. 1962.
[23] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions
as space-time shapes,” in The Tenth IEEE International Conference on
Computer Vision (ICCV’05), 2005, pp. 1395–1402.
44
[24] Wikipedia. K-nearest neighbors algorithm. [Online]. Available: https:
//en.wikipedia.org/wiki/K-nearest neighbors algorithm
[25] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks
for human action recognition,” IEEE Transactions on Pattern Analysis
& Machine Intelligence, vol. 35, pp. 221–231, Jan. 2013.
[26] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, C. Schmid et al., “Evaluation
of local spatio-temporal features for action recognition,” in BMVC 2009-
British Machine Vision Conference, Sep. 2009, pp. 545–554.
[27] Wikipedia. Principal component analysis. [Online]. Available: https:
//en.wikipedia.org/wiki/Principal component analysis
[28] A. Hyvarinen, J. Hurri, and P. O. Hoyer, Natural image statistics.
Springer, 2009.
[29] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector
machines,” ACM Transactions on Intelligent Systems and Technology
(TIST), vol. 2, no. 3, pp. 27:1–27:27, May 2011.
[30] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic
human actions from movies,” in IEEE Conference on Computer Vision
& Pattern Recognition CVPR, June 2008, pp. 1–8.
45
Project Detail
Student Details
Student Name Arpit Jain
Registration Number 090911345
Section/Roll No. A/18
Email Address [email protected]
Phone No.(M) 9901536320
Student Details
Student Name Satakshi Rana
Registration Number 090911281
Section/Roll No. A/14
Email Address [email protected]
Phone No.(M) 9590329141
Project Details
Project title Human Activity Recognition using Temporal Templates
and Unsupervised Feature-Learning Techniques
Project Duration 5 Months
Date of Reporting 07-01-2013
Organization Details
Organization Name Manipal Institute of Technology
Full Postal Address MIT,Manipal
Website Address www.manipal.edu
Internal Guide Details
Faculty Name Dr. Sanjay Singh
Full Contact Address with PIN Code Dept of I&CT, MIT, Manipal-576104
Email Address [email protected]
46