+ All Categories
Home > Documents > Project Report Human Action Recognition

Project Report Human Action Recognition

Date post: 26-Dec-2015
Category:
Upload: arpitjain2811
View: 196 times
Download: 2 times
Share this document with a friend
Description:
Human Action Recognition
Popular Tags:
60
Human Activity Recognition using Temporal Templates and Unsupervised Feature-Learning Techniques A project report submitted to MANIPAL UNIVERSITY For Partial Fulfillment of the Requirement for the Award of the Degree of Bachelor of Engineering in Information Technology by Arpit Jain Reg. No. 090911345 Satakshi Rana Reg. No. 090911281 Under the guidance of Dr. Sanjay Singh Associate Professor Department of Information and Communication Technology Manipal Institute of Technology Manipal, India May 2013
Transcript
Page 1: Project Report Human Action Recognition

Human Activity Recognition using Temporal Templates

and Unsupervised Feature-Learning Techniques

A project report submitted

to

MANIPAL UNIVERSITY

For Partial Fulfillment of the Requirement for the

Award of the Degree

of

Bachelor of Engineering

in

Information Technology

by

Arpit Jain

Reg. No. 090911345

Satakshi Rana

Reg. No. 090911281

Under the guidance of

Dr. Sanjay Singh

Associate Professor

Department of Information and Communication Technology

Manipal Institute of Technology

Manipal, India

May 2013

Page 2: Project Report Human Action Recognition

I dedicate my thesis to my parents for their support and

motivation.

Arpit Jain

For my parents and my sister, Shreyaa.

Satakshi Rana

i

Page 3: Project Report Human Action Recognition

DECLARATION

I hereby declare that this project work entitled Human Activity Recog-

nition using Temporal Templates and Unsupervised Feature-Learning

Techniques is original and has been carried out by us in the Department of In-

formation and Communication Technology of Manipal Institute of Technology,

Manipal, under the guidance of Dr. Sanjay Singh, Associate Professor,

Department of Information and Communication Technology, M. I. T., Mani-

pal. No part of this work has been submitted for the award of a degree or

diploma either to this University or to any other Universities.

Place: Manipal

Date :24-05-2013

Arpit Jain

Satakshi Rana

ii

Page 4: Project Report Human Action Recognition

CERTIFICATE

This is to certify that this project entitled Human Activity Recogni-

tion using Temporal Templates and Unsupervised Feature-Learning

Techniques is a bonafide project work done by Mr. Arpit Jain and Ms.

Satakshi Rana at Manipal Institute of Technology, Manipal, independently

under my guidance and supervision for the award of the Degree of Bachelor of

Engineering in Information Technology.

Dr. Sanjay Singh Dr.Preetham Kumar

Associate Professor Head

Department of I & CT Department of I & CT

Manipal Institute of Technology Manipal Institute of Technology

Manipal, India Manipal,India

iii

Page 5: Project Report Human Action Recognition

ACKNOWLEDGEMENTS

We hereby take the privilege to express our gratitude to all the people who were

directly or indirectly involved in the execution of this work, without whom

this project would not have been a success. We extend our deep gratitude

to the lab assistants and Vinay sir for their co-operation and in helping us

download dataset for our project. We thank Dr. Sanjay Singh for his timely

suggestions & guidance and Dr. Preetham Kumar for being a constant source

of inspiration. Our hearty thanks to Rahul, our classmates and teachers who

have supported us in all possible ways. Our deepest gratitude to our family

and friends who have been supportive all the while.

iv

Page 6: Project Report Human Action Recognition

ABSTRACT

Computer vision includes methods for acquiring, processing, analyzing, and

understanding images. Applications of this field include detecting events, con-

trolling processes, navigation, modelling objects or environments, automatic

inspection and many more. Activity recognition is one of the applications

of computer vision that aims to recognize the actions and goals of one or

more agents from a series of observations on the agents actions and the en-

vironmental conditions. The goal of the project is to train an algorithm to

automate detection and recognition of human activities performed in the video

data. The project can be utilized in scenarios such as surveillance systems,

intelligent environment, sports play analysis and web based video retrieval.

In the first phase of the project, actions like bending, side galloping and

hand wave were recognized using a Temporal Template matching technique

called Motion History Image(MHI) methodology. MHI method is exten-

sively being used to represent the history of temporal changes involved in the

execution of an activity. Here the intensity of the pixels of a scalar valued

image are varied, depending upon the motion history. MHI suffers from a

very prominent drawback that when self-occluding or over-writing activities

are encountered it fails to generate a vivid description of the course of that

activity. This happens because when repetitive activities are performed, the

current action overwrites or deletes the information of the previous action. In

the project, we successfully devised and implemented a methodology which

overcame this problem of conventional MHI. In our mechanism, we have uti-

lized red, green and blue channels of 3 different 3-channeled images to rep-

resent human movement. In the methodology used, feature extraction of the

Motion History Image and Motion Energy Image was performed using 7-Hu

Moment and thereafter training and classification was performed using k-NN

Algorithm.

v

Page 7: Project Report Human Action Recognition

The second phase of the project focusses on a recent development in ma-

chine learning, known as “Deep Learning” which works on the concept of

biologically inspired computer vision to achieve the task of unsupervised

feature-learning. Stacked convolutional Independent-Subspace Analy-

sis is used for unsupervised feature extraction on Hollywood2 dataset. The

features are then utilized to classify videos in the dataset in their appropriate

class using Support Vector Machine(SVM). The system is capable of recogniz-

ing 12 different activities performed by humans.

vi

Page 8: Project Report Human Action Recognition

Contents

Acknowledgements iv

Abstract v

List of Tables ix

List of Figures xi

Abbreviations xi

Notations xiii

1 Introduction 1

1.1 Literature Survey . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Report Organisation . . . . . . . . . . . . . . . . . . . . . . . 5

2 Recognizing Human Activities using Temporal Templates 6

2.1 Motion-Energy Image(MEI) . . . . . . . . . . . . . . . . . . . 7

2.2 Motion History Image(MHI) . . . . . . . . . . . . . . . . . . . 8

2.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Training and Classification . . . . . . . . . . . . . . . . . . . . 12

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 3-Channel Motion History Images 15

3.1 Failure of MHI in case of Repetitive Actions . . . . . . . . . . 15

vii

Page 9: Project Report Human Action Recognition

3.2 Improved MHI Methodology . . . . . . . . . . . . . . . . . . . 16

3.3 Proposed Algorithm & Results Obtained . . . . . . . . . . . . 20

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Unsupervised Feature Learning for Activity Recognition 25

4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 PCA Whitening . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 Independent Subspace Analysis . . . . . . . . . . . . . . . . . 27

4.4 Stacked ISA for Video Domain . . . . . . . . . . . . . . . . . . 29

4.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 30

4.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.5.2 Framework for Classification . . . . . . . . . . . . . . . 31

4.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Snapshots 35

6 Conclusion and Future Work 40

References 42

ProjectDetail 45

viii

Page 10: Project Report Human Action Recognition

List of Tables

2.1 Action Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.1 Hollywood2 Action Dataset . . . . . . . . . . . . . . . . . . . 31

4.2 Performance of System . . . . . . . . . . . . . . . . . . . . . 34

ix

Page 11: Project Report Human Action Recognition

List of Figures

1.1 Approaches for Activity Recognition . . . . . . . . . . . . . . 3

2.1 MEI for Hand Wave . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Different cases of MHI formation for Hand Waving Activity . . 8

2.3 Feature Extraction from MHI and MEI . . . . . . . . . . . . . 12

3.1 MHI Failure: In case of Repetitive Action . . . . . . . . . . . 16

3.2 Initially when no activity is performed . . . . . . . . . . . . . 18

3.3 Activity starts for the first time . . . . . . . . . . . . . . . . . 19

3.4 Hands raised in the anti-clockwise direction . . . . . . . . . . 19

3.5 Actions performed in the clockwise direction . . . . . . . . . . 19

3.6 Hand raised upwards for the second time . . . . . . . . . . . . 20

3.7 Graph 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.8 Graph 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 ISA Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Stacked ISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 Stacked ISA for video data . . . . . . . . . . . . . . . . . . . . 30

5.1 Graphical User Interface . . . . . . . . . . . . . . . . . . . . . 35

5.2 Bending Activity . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.3 MHI and MEI for Bending Activity . . . . . . . . . . . . . . . 36

5.4 Side Galloping Activity . . . . . . . . . . . . . . . . . . . . . . 37

5.5 MHI and MEI for Side Galloping Activity . . . . . . . . . . . 37

5.6 Training Completion . . . . . . . . . . . . . . . . . . . . . . . 37

x

Page 12: Project Report Human Action Recognition

5.7 Test Bending Activity . . . . . . . . . . . . . . . . . . . . . . 38

5.8 Test Side Galloping Activity . . . . . . . . . . . . . . . . . . . 38

5.9 Actions in Hollywood2 Dataset . . . . . . . . . . . . . . . . . 38

5.10 Compute Features . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.11 Classification Results . . . . . . . . . . . . . . . . . . . . . . . 39

xi

Page 13: Project Report Human Action Recognition

ABBREVIATIONS

MHI Motion History Image

MEI Motion Energy Image

ISA Independent Subspace Analysis

SVM Support Vector Machine

CNN Convolutional Neural Network

PCA Principal Component Analysis

xii

Page 14: Project Report Human Action Recognition

NOTATIONS

I(x, y) : Pixel value at (x, y) location

I(x, y, t) : Pixel value of the tth frame at (x, y) location

Mij : Moment of order (i+ j)

ηij : Central moment of order (i+ j)

xi : ith training sample

λi : ith eigenvalue

U : Left-singular vector after singular value decomposition

xiii

Page 15: Project Report Human Action Recognition

Chapter 1

Introduction

Computer vision is a field that includes methods for acquiring, processing,

analyzing, and understanding images and, in general, high-dimensional data

from the real world in order to produce numerical or symbolic information,

e.g., in the forms of decisions [1]. A theme in the development of this field

has been to duplicate the abilities of human vision by electronically perceiving

and understanding an image. Applications range from tasks such as industrial

machine vision systems which say, inspect bottles speeding by on a production

line, to research into artificial intelligence and computers or robots that can

comprehend the world around them. Computer vision is concerned with the

theory behind artificial systems that extract information from images. The

image data can take many forms, such as video sequences, views from multiple

cameras, or multi-dimensional data from a medical scanner. The classical

problem in computer vision, is that of determining whether or not the image

data contains some specific object, feature, or activity.

Activity recognition is one of the sub-fields of computer vision that

aims to recognize the actions and goals of one or more agents from a series

of observations on the agents’ actions and the environmental conditions [2].

Human activity recognition focusses on accurate detection of human activities

based on a pre-defined activity model. It can be exploited to great societal

1

Page 16: Project Report Human Action Recognition

benefits, especially in real life, human-centric applications such as eldercare

and healthcare [3].

1.1 Literature Survey

The various approaches for recognizing human activities are classified into Sin-

gle Layered Approaches and Hierarchical Approaches [4]. Single-layered ap-

proaches are further classified into two types depending on how they model hu-

man activities: i.e., space time approaches and sequential approaches. Space-

time approaches view input as a 3-D(XYT) volume, while sequential ap-

proaches view and interpret the input video as a sequence of observations.

Space time approaches are further divided into three categories based on the

features they use from the 3-D space-time volumes: space-time volumes, tra-

jectories and local interest point detectors. Sequential approaches are classified

depending on whether they use exemplar-based recognition methodologies or

model-based recognition methodologies. In single layered approaches, each

activity corresponds to a cluster containing image sequences for that activity.

These clusters are categorized into classes each having some definite prop-

erty, so when an input is given to the system various algorithms and method-

ologies like neighbour-based matching [5], template matching [6], statistical

modelling [7] algorithm are applied to categorize the input activity into its

appropriate class.

Hierarchical Approaches works on the concept of divide and conquer; in

which any complex problem can be solved by dividing it into several sub prob-

lems. The sub-activities are used to identify the main complex activity. These

approaches are classified on the basis of the recognition methodologies they

use: statistical approach, syntactic approach and description-based approach.

Statistical approaches construct statistical state-based models like layered Hid-

den Markov Model(HMM) to represent and recognize high level human activi-

2

Page 17: Project Report Human Action Recognition

Figure 1.1: Approaches for Activity Recognition

ties [8]. Similarly syntactic approach use a grammar syntax such as a stochastic

context free grammar(SCFG) to model sequential activities [9]. Description-

based approach represent human activities by describing sub-events of the

activities and their temporal, spatial and logical structures [10]. Fig. 1.1 sum-

marizes the hierarchical approach based taxonomy of the approaches used in

human activity recognition.

Ke et al. [11] used segmented spatio-temporal volumes to model human

activities, their system applied a hierarchical mean-shift to clusters of similar

coloured voxels, and obtain several segmented volumes. The motivation is

to find the actor volume segments automatically and to measure their simi-

larity to the action model. Recognition is done by searching for a subset of

over-segmented spatio-temporal volumes that best matches the space of the

action model. Their system recognized simple actions such as hand waving

and boxing from the KTH action database.

Laptev [12] recognized human actions by extracting sparse spatio-temporal

interest points from videos. They extended the local feature detectors(Harris)

commonly used for object recognition, in order to detect interest points in

space-time volume. Motion patterns such as change in direction of object;

splitting and merging of an image structure; and collision/bouncing of object

are detected as a result. In their work, these features were used to distinguish

a walking person from complex backgrounds.

Bobick and Davis [13] constructed a real time action recognition system

3

Page 18: Project Report Human Action Recognition

using template matching. Instead of maintaining the 3-dimensional space-time

volume of each action, they represented each action with a template composed

of two 2-dimensional images: binary Motion Energy Image and scalar Motion

History Image. The two images are constructed from a sequence of foreground

images, which essentially are weighted 2-D projections of the original 3-D

(XYT) volume. By applying a traditional template matching technique to

a pair of (MEI,MHI), their system was able to recognise simple actions

like sitting, arm waving and crouching. However the MHI method suffers

from a serious drawback that when self-occluding or overwriting actions are

encountered it leads to severe recognition failure [14]. This failure results

because when repetitive action is performed the same pixel location is accessed

multiple times due to which the previously stored information in the pixel

gets overwritten or deleted by the current action. In order to address this

issue we implemented a novel technique for creating motion history images

that are capable of representing self-occluding and overwriting action. Our

methodology overcomes the limitation of representing repetitive activities and

thus outshines the conventional MHI method.

Previous work on action recognition has focussed on adapting hand-designed

local features, such as SIFT [15] or HOG [16], from static images to video do-

main. Andrew Y Ng et al. [17] proposed a methodology to learn features

directly from video data using unsupervised feature learning. They used an

extension of the Independent Subspace Analysis algorithm to learn invari-

ant spatio-temporal features from unlabelled video data. By replacing hand-

designed features with learnt features they achieved classification results su-

perior to all various published results on Hollywood2 [18], UCF [19], KTH [20]

and YouTube [21] action recognition dataset. In our project we have worked on

Deep Learning techniques such as stacking and convolution to learn hierarchi-

cal representations. The hierarchical features are extracted using unsupervised

learning technique to classify 12 categories of human activities using Support

4

Page 19: Project Report Human Action Recognition

Vector Machines.

1.2 Report Organisation

The rest of the report is organized as follows:

• Chapter 2: Gives a detailed description of activity recognition using a

temporal template matching technique, i.e. Motion History Image(MHI)

methodology. Construction of MHI and MEI images, feature extraction

and training of the algorithm on Weizmann dataset are shown for classi-

fication of an input video sequence as a bending or side galloping activity.

• Chapter 3: The drawbacks of MHI method are discussed and a novel

approach to overcome the drawbacks is proposed. The algorithm for the

proposed approach and simulated results are provided.

• Chapter 4: Gives an overview of previous hand-coded approaches used

for activity recognition. Thereafter activity recognition for Hollywood2

Action Dataset is discussed using recent approaches on Unsupervised

Feature Learning. Stacked Independent Subspace Analysis(ISA) is trained

for 12 different classes of activities and framework for classification is

provided. Results obtained and comparison with previous results is pro-

vided.

• Chapter 5: Snapshots of the GUI and the system’s performance results

are provided.

• Chapter 6: Concludes the report and gives the future work.

5

Page 20: Project Report Human Action Recognition

Chapter 2

Recognizing Human Activities

using Temporal Templates

Motion History Image(MHI) methodology is a representation and recognition

theory that decomposed motion-based recognition into first describing where

there is motion (the spatial pattern) and then describing how the motion

is moving. In the methodology a binary Motion-Energy Image(MEI) and a

scalar-valued Motion-History Image(MHI) are constructed. MEI represents

where motion has occurred in an image sequence and MHI depicts how the

motion has occurred. Taken together, the MEI and MHI can be considered

as a two component version of a temporal template, a vector-valued image

where each component of each pixel is some function of the motion at that

pixel location. These view-specific templates are matched against the stored

models of views of known as movements.

The initial step for the construction of MEI and MHI involves background

subtraction and obtaining silhouettes. So the captured image frame from the

video is first converted into a grey scale image which is done by using Eq.

2.1. Here the pixel value Y ′ of the grey image is determined using the RGB

components of original image.

Y ′ = 0.2126R + 0.7152G+ 0.0722B (2.1)

6

Page 21: Project Report Human Action Recognition

At each step, the current grey image (src1) is compared with the previous

image (src2). This is done to find the difference of the two images which is

stored in (dst), to identify the changes that help in detecting the motion in

the video. The difference is calculated using Eq. 2.2:

dst(i)c = |src1(I)c − src2(I)c| (2.2)

Then binary threshold is applied for the construction of the silhouettes which

is the outline of the image using Eq. 2.3:

dst(x, y) =

maxvalue if src(x, y) > threshold,

0 if otherwise

(2.3)

2.1 Motion-Energy Image(MEI)

Motion-Energy Images(MEI) are binary cumulative motion images that are

computed from the start frame to the last frame of the video sequence. MEI

represents the region where movement occurred in a video data. Let I(x, y, t)

be an image sequence and let D(x, y, t) be a binary image sequence indicating

regions of motion then, the binary MEI is defined as :

Eτ (x, y, t) =τ−1⋃i=0

D(x, y, t− i) (2.4)

The duration of τ is critical in defining the temporal extent of a movement.

Figure 2.1 shows the MEI image for a hand waving activity.

Figure 2.1: MEI for Hand Wave

7

Page 22: Project Report Human Action Recognition

2.2 Motion History Image(MHI)

In motion history images, the intensity of each pixel is varied based on how

early an activity was performed. The more recent is the activity; higher will

be its intensity. This variation of intensities helps in finding the direction of

motion and the course of action. For the formation of the MHI image each

pixel’s intensity at location (x, y) at time t is given by:

MHI(x, y, t) =

ts if silhouette(x, y, t) 6= 0

0 if silhouette(x, y, t) = 0 &

MHI(x, y, t− 1) < (ts − d)

MHI(x, y, t− 1) otherwise

(2.5)

Here in Eq. 2.5:

MHI(x, y, t): value of pixel at location (x, y) at time t

ts: timestamp which denotes the current time

d: duration for which motion is tracked

Every pixel of image which is traced will satisfy one of the four cases as de-

scribed below:

Figure 2.2: Different cases of MHI formation for Hand Waving Activity

8

Page 23: Project Report Human Action Recognition

• Case 1: Point 1 in Fig. 2.2 corresponds to the pixel where motion has

just occurred in the current silhouette frame i.e silhouette(x, y, t) 6= 0.

Therefore the corresponding pixel in the MHI image is set to the value

of timestamp i.e. MHI(x, y, t) = timestamp .

• Case 2: Point 2, represents the pixel where motion has not occurred in

the current silhouette frame i.e silhouette(x, y, t) = 0 and whose pixel

value at time t-1 is less than (timestamp-duration) i.e. MHI(x, y, t −

1) < (timestamp − duration) and therefore the corresponding pixel in

the MHI image is set to 0.

• Case 3: In the third point, there was no motion in the current silhou-

ette frame, however its previous value (MHI(x, y, t−1)) is not less than

(timestamp-duration) and therefore the corresponding pixel’s previous

value is retained in the MHI image. But as the timestamp is increasing

continuously, the intensity of this pixel is relatively less than point 1.

• Case 4: In point 4, motion never occurred throughout the execution

of the activity i.e silhouette(x, y, t) = 0 for all values of timestamp ts.

Therefore the corresponding pixel’s value in the MHI image is set to 0.

2.3 Feature Extraction

Transforming the input data into the set of features is called feature extraction.

If the features extracted are carefully chosen it is expected that the features set

will extract the relevant information from the input data in order to perform

the desired task using this reduced representation instead of the full size input.

Feature extraction involves simplifying the amount of resources required to

9

Page 24: Project Report Human Action Recognition

describe a large set of data accurately. When performing analysis of complex

data one of the major problems stems from the number of variables involved.

Analysis with a large number of variables generally requires a large amount of

memory and computation power or a classification algorithm which over-fits

the training sample and generalizes poorly to new samples. Feature extraction

is a general term for methods of constructing combinations of the variables

to get around these problems while still describing the data with sufficient

accuracy.

The Statistical descriptions of a set of MEIs and MHIs images for each

view/movement combination is given to the system using moment-based fea-

tures. An image moment is a certain particular weighted average (moment)

of the image pixels’ intensities, or a function of such moments, usually chosen

to have some attractive property or interpretation. Simple properties of the

image which are found via image moments include area (or total intensity), its

centroid, and information about its orientation. Image with pixel intensities

I(x,y), raw image moments Mij are calculated by:

Mij =∑x

∑y

xiyjI(x, y) (2.6)

Central moments are defined as:

µpq =∑x

∑y

(x− x)p(y − y)qf(x, y) (2.7)

Where:

x =M10

M00

(2.8)

y =M01

M00

(2.9)

Moments ηij where i + j ≥ 2 can be constructed to be invariant to both

translation and changes in scale by dividing the corresponding central moment

by the properly scaled (00)th moment, using the following formula:

ηij =µij

µ(1+ i+j

2 )00

(2.10)

10

Page 25: Project Report Human Action Recognition

It is possible to calculate moments which are invariant under translation,

changes in scale, and also rotation. Most frequently used are the Hu set of

invariant moments [22]:

I1 = η20 + η02 (2.11)

I2 = (η20 − η02)2 + 4η211 (2.12)

I3 = (η30 − 3η12)2 + (3η21 − η03)2 (2.13)

I4 = (η30 + η12)2 + (η21 + η03)

2 (2.14)

I5 = (η30 − 3η12)(η30 + η12)[(η30 + η12)2 − 3(η21 + η03)

2]+ (2.15)

(3η21 − η03)(η21 + η03)[3(η30 + η12)2 − (η21 + η03)

2] (2.16)

I6 = (η20 − η02)[(η30 + η12)2 − (η21 + η03)

2] + 4η11(η30 + η12)(η21 + η03)

(2.17)

I7 = (3η21 − η03)(η30 + η12)[(η30 + η12)2 − 3(η21 + η03)

2]− (2.18)

(η30 − 3η12)(η21 + η03)[3(η30 + η12)2 − (η21 + η03)

2] (2.19)

Using these 7-Hu moments, seven features are extracted from both MHI

and MEI images. The obtained vector of 1 × 14 dimension is given to the

system for training and classification purpose as shown in Fig. 2.3.

2.4 Dataset

The system was trained and tested on Weizmann Dataset [23] and the dataset

which was made in the Information and Communication Department of MIT,

Manipal. Weizmann Dataset contains seven training videos for bending and

side galloping activities. Also each activity has one test video. The ICT

dataset contain three training videos and one test video for each activity. Total

number of training videos are 20 and 4 test videos for two classes of actions

namely bending and side galloping. The training data used is summarized in

Table 2.1.

11

Page 26: Project Report Human Action Recognition

Figure 2.3: Feature Extraction from MHI and MEI

Table 2.1: Action Dataset

Weizmann Dataset

Training Videos Test Videos

Bending 7 1

Side Galloping 7 1

Manipal Dataset

Training Videos Test Videos

Bending 3 1

Side Galloping 3 1

All Samples 20 4

2.5 Training and Classification

Supervised learning is the machine learning task of inferring a function from

labeled training data. The training data consist of a set of training examples.

In supervised learning, each example is a pair consisting of an input object

(typically a vector) and a desired output value (also called the supervisory sig-

nal). A supervised learning algorithm analyzes the training data and produces

12

Page 27: Project Report Human Action Recognition

an inferred function, which can be used for mapping new examples. In the

project the training examples that are provided to the system are in the form

of n × 14 matrix, where n is the number of training videos provided to the

system and 14 are the features for MHI and MEI images corresponding to each

video. Along with this a n× 1 matrix is provided to the system that specifies

the class for each training video given to the system. Therefore, corresponding

to each training video, 14 features of the video and the class label of the video

is given to the system.

In pattern recognition, the k-nearest neighbour algorithm (k-NN) is a non-

parametric method for classifying objects based on closest training examples

in the feature space [24]. An object is classified by a majority vote of its

neighbours, with the object being assigned to the class most common amongst

its k nearest neighbours (k is a positive integer, typically small). If k = 1,

then the object is simply assigned to the class of its nearest neighbour.

When an unlabelled input video is given to the system, MHI and MEI

images are constructed corresponding to the video. From these MHI and

MEI images, features are extracted using 7-Hu Moments. The system then,

computes Euclidean Distance between the features of the input image and the

features of labelled training videos given to the system using Equation:

d(p, q) = d(q, p) =√

(q1 − p1)2 + (q2 − p2)2 + · · ·+ (qn − pn)2 (2.20)

Given the euclidean distance between the input unlabelled video and the train-

ing videos, training video are arranged in the ascending order of the distance

calculated. From these, voting algorithm is applied on the top k training videos

to determine the class of the unlabelled input video sequence. The system is

trained on Weizmann dataset and tested on real time video. The system is

successfully able to recognize bending and side-galloping activities.

13

Page 28: Project Report Human Action Recognition

2.6 Summary

The chapter provided a detailed overview of MHI methodology, feature extrac-

tion, training and classification modules involved in the activity recognition

from video sequences. Training was performed on Weizmann dataset and test-

ing was done for real-time videos. The system was made capable of recognizing

bending and side galloping activities.

14

Page 29: Project Report Human Action Recognition

Chapter 3

3-Channel Motion History

Images

Motion History Image(MHI) methodology is a very popular method used in ac-

tivity recognition. However, it suffers from a very serious drawback, i.e. it fails

when repetitive and over-writing activities are encountered. In this chapter

we demonstrate the failure of MHI method and provide a novel methodology

which has overcome the drawbacks of conventional MHI method. Algorithm

of our proposed methodology and simulated results are given.

3.1 Failure of MHI in case of Repetitive Ac-

tions

In the case of repetitive actions, there will be two or multiple intensities stored

at a single location. This result in overwriting and deletion of previous in-

tensity stored at a particular pixel. The same is shown in Fig. 3.1 where

waving action is being performed first in the anti-clockwise direction then in

the clockwise direction. The resulting figure does not leads to formation of

any conclusion as the previous information of the anti-clockwise movement

has been erased by the clockwise movement. This proves that MHI com-

15

Page 30: Project Report Human Action Recognition

pletely fails when overwriting and self- occluding actions are being performed.

This drawback of the MHI method cannot be over looked because overwriting

Figure 3.1: MHI Failure: In case of Repetitive Action

and self-occluding actions are very common and are encountered a number of

times in activity recognition. To address this issue in MHI, we have devised

a methodology which will not only characterize how the motion occurred but

will also be able to represent repetitive activities successfully.

3.2 Improved MHI Methodology

In our mechanism the initial steps of background subtraction and formation

of silhouettes remain same as the normal MHI method. However instead of

using a scalar valued image as in the case of normal MHI, we have used red,

green and blue channels of three different 3-channel images. A single frame of

the activity video is taken and three different images namely MHIR, MHIG,

MHIB are computed. Initially, when no activity is being performed, the pixel

intensities of all the three images are null as shown in Fig. 3a-3c. Now to find

the motion history images, each pixel of the silhouette frame is examined with

the corresponding pixel of MHIR, MHIG and MHIB. When motion occurs at a

particular pixel location of the silhouette frame, the corresponding pixel value

of MHIR, MHIG, MHIB are checked. If the value is found to be null for all

the channels, it implies that the motion has occurred for the first time at that

pixel location. In that case, motion is represented using the red channel of

MHIR as in Fig. 3d-3i where MHIR, MHIG, MHIB respectively are shown for

16

Page 31: Project Report Human Action Recognition

the movement of an arm upwards in the anti-clockwise direction. Here each

pixels RGB value of MHIR is determined in accordance with equations:

R(x, y, t) =

ts if silhouette(x, y, t) 6= 0

0 if silhouette(x, y, t) = 0 &

R(x, y, t− 1) < (ts − d)

R(x, y, t− 1) otherwise

(3.1)

G(x, y, t) = 0 (3.2)

B(x, y, t) = 0 (3.3)

In the case when motion occurs at a particular pixel location in the sil-

houette frame and the corresponding pixel’s red channel value of MHIR is

greater than a specified threshold but blue and green channel values of MHIB

and MHIG respectively are null. This suggests that a repetitive action is per-

formed at that pixel location. Therefore, we represent that action using the

green channel of MHIG and MHIR remains unchanged. So in this way, there is

no overwriting of the previous action in the MHIR by the current action. Fig.

3j-3l shows MHI images when an action is performed first in the anticlockwise

direction and then in the clockwise direction. Each pixel’s RGB value of MHIG

is varied according to equations:

R(x, y, t) = 0 (3.4)

G(x, y, t) =

ts if silhouette(x, y, t) 6= 0

0 if silhouette(x, y, t) = 0 &

G(x, y, t− 1) < (ts − d)

G(x, y, t− 1) otherwise

(3.5)

B(x, y, t) = 0 (3.6)

17

Page 32: Project Report Human Action Recognition

If a repetitive action is further continued for the second time at a pixel

location in the silhouette frame then the corresponding pixel’s green channel

value of MHIG will not be null. Then in such a situation the blue channel of

MHIB is utilized to represent this action and MHIR & MHIG are unchanged.

Therefore the previous actions in MHIR and MHIG are not over-written by

the current action in MHIB. Fig. 3m-3o shows MHI images when an action

is performed in the anticlockwise direction as depicted in MHIR, then in the

clockwise direction as shown in MHIG and finally again in the anti-clockwise

direction as shown in the MHIB. Each pixels RGB value of MHIB is determined

using the equations:

R(x, y, t) = 0 (3.7)

G(x, y, t) = 0 (3.8)

B(x, y, t) =

ts if silhouette(x, y, t) 6= 0

0 if silhouette(x, y, t) = 0 &

B(x, y, t− 1) < (ts − d)

B(x, y, t− 1) otherwise

(3.9)

Therefore our methodology is effective in presenting not only simple non repet-

itive actions but also complex self-occluding and overwriting activities.

MHIR

Fig 3a

MHIG

Fig 3b

MHIB

Fig 3c

Figure 3.2: Initially when no activity is performed

18

Page 33: Project Report Human Action Recognition

Fig 3d Fig 3e Fig 3f

Figure 3.3: Activity starts for the first time

Fig 3g Fig 3h Fig 3i

Figure 3.4: Hands raised in the anti-clockwise direction

Fig 3j Fig 3k Fig 3l

Figure 3.5: Actions performed in the clockwise direction

19

Page 34: Project Report Human Action Recognition

Fig 3m Fig 3n Fig 3o

Figure 3.6: Hand raised upwards for the second time

3.3 Proposed Algorithm & Results Obtained

Graph 1 shows the number of active pixels per frame for MHIR, MHIG, MHIB

here the number of active pixels implies the count of pixels whose value is not

null. When the hand moves upwards the number of active pixels per frame

of MHIR image increases continuously and its corresponding graph increases

whereas the graphs for MHIG and MHIB continue to remain null. Then when

the hand stops for some time and there is no motion performed, the graph of

MHIR decreases as the count of active pixels decreases in accordance with the

Eq. 3.1. Thereafter when the hand moves downwards, MHIR image becomes

constant since a repetitive action is encountered and this new action is rep-

resented in MHIG. The graph of MHIG increases till the action is performed

in the downward direction, and then starts decreasing when the hand stops

for some time. When the hand moves again in the upward direction, MHIG

image becomes constant and so does the graph for MHIG and this new action

is represented in MHIB. The graph for MHIB continuously increases when

hand moves upwards and then starts decreasing when the hand stops.

In Graph 2 the summation of pixel values for a 20 × 20 box centred at

location (287, 193) is plotted for each frame of MHIR, MHIG, MHIB. Initially

when the hand is moving in the upward direction, since motion is occurring

20

Page 35: Project Report Human Action Recognition

Figure 3.7: Graph 1

for the first time the action is represented in MHIR whereas MHIG and MHIB

continue to remain null images. The summation of pixel values of the box for

the MHIR image is initially null implying that the motion is occurring outside

the box. Later the MHIR graph continues to increase with the frame number,

implying that the hand is moving inside the box. At frame no. 33 the graph for

MHIR starts decreasing, this happens due to the fading of the pixels of the box

when the hand has moved out of the box. From the 50th frame the hand starts

moving in the downward direction, and thus MHIR image becomes constant

and this new action is represented in the MHIG image. When this repetitive

action is performed the MHIR graph becomes constant and MHIG continue

to remain null till the movement of hand downwards is not performed inside

the box. At frame no. 207, MHIG graph starts increasing implying that the

motion is occurring inside the box. After frame no. 216, the MHIG graph starts

decreasing due to the fading of the pixels when the hand has moved out of the

box. Thereafter when an action is again performed in the upward direction,

MHIG becomes constant and the new action is represented in MHIB. Here

also MHIB graph remains null till the action is not performed inside the box.

21

Page 36: Project Report Human Action Recognition

Thereafter MHIB graph increases for the frames in which action is performed

inside the box and then MHIB graph decreases due to the fading of the pixels

when the hand has moved out of the box.

Figure 3.8: Graph 2

Algorithm 1 describes the proposed methodology using which the simulated

results are obtained.

22

Page 37: Project Report Human Action Recognition

Algorithm 1 Improved MHI Algorithm

Data: silhouette, MHIR, MHIG, MHIB, timestamp, duration

for x:0 to image.width do

for y:0 to image.height do

valR= MHIR(x,y); valG= MHIG(x,y); valB= MHIB(x,y) if silhou-

ette(x,y) then

if valR=0 & valG = 0 & valB = 0then

valR=timestamp goto setvalue

end

if valR!=0 & valG = 0 & valB = 0then

valG=timestamp goto setvalue

end

if valR!=0 & valG! = 0 & valB = 0then

valB=timestamp goto setvalue

end

else

if valR!=0 & valG = 0 & valB = 0then

valR=valR<(timestamp-duration)? 0:valR goto setvalue

end

if valR!=0 & valG! = 0 & valB = 0then

valG=valG<(timestamp-duration)?0:valG goto setvalue

end

if valR!=0 & valG! = 0 & valB! = 0then

valB=valB<(timestamp-duration)? 0:valB goto setvalue

end

end

setvalue:

MHIR(x,y)=valR; MHIG(x,y)=valG; MHIB(x,y)=valB end

end

23

Page 38: Project Report Human Action Recognition

3.4 Summary

Motion History Image contain necessary details about how an action was per-

formed. Conventional MHI method suffers from many drawbacks. One of

them is failure in the cases of over-writing and self-occluding activities. In this

paper we have presented and implemented a novel technique for the formation

of motion history images which are successful in representing repetitive activ-

ities. In our methodology an action is first represented using MHIR and when

a repetitive action is performed it is represented using MHIG. If a repetitive

action is further continued it is represented in MHIB. Thus in our approach

when a repetitive activity is performed the information of the previous ac-

tion is retained and is not overwritten. In the report we have provided an

algorithm of our approach and have experimentally proved our methodology

through graphs and simulations for repetitive activities.

24

Page 39: Project Report Human Action Recognition

Chapter 4

Unsupervised Feature Learning

for Activity Recognition

Previous work done in the field of activity recognition focussed on hand-coded

feature learning techniques. In the subsequent sections an overview of the

previous approaches and their drawbacks is given and a recent advances in

Unsupervised Feature-Learning are discussed. The system is trained on Hol-

lywood2 Action Dataset using stacked Independent Subspace Analysis. The

features thus learnt, are used for classification using Support Vector Machine.

4.1 Related Work

Previous approaches in activity recognition rely on hand-designed features

such as SIFT and HOG. A weakness of such approaches is that it is difficult

and time-consuming to extend these features to other sensor modalities, such

as laser scans, text or even videos. Most of the current approaches make

certain assumptions (Example: small scale and viewpoint changes) about the

circumstances under which the video was taken. However, such assumptions

seldom hold in the real world environment. In addition, most of the methods

follow a two-step approach in which the first step computes features from

raw video frame and the second step learns classifiers based on the obtained

25

Page 40: Project Report Human Action Recognition

features. In real world scenarios, it is rarely known what features are important

for the task at hand, since the choice of features is highly problem-dependent

[25].

In recent years, low-level hand designed features have been heavily em-

ployed with much success. Typical examples of such successful features for

static images are SIFT, HOG, GLOH and SURF. Extending the above fea-

tures to 3-D is the pre-dominant methodology in video action recognition.

These methods usually have two-stages: an optional feature detection stage

followed by a feature description stage. Well known feature detection meth-

ods are Harris 3-D, Cuboids and Hessian. For descriptors popular methods are

Cuboids, HOG 3-D and extended SURF. Wang et al. [26] combined various

low level feature detection, feature description methods and benchmark their

performance on KTH, UCF Sports Action and Hollywood 2 dataset. They

used the state-of-the-art processing pipeline with vector quantization, feature

normalization and χ2 kernel SVMs. Their most interesting findings was that

there was no universally best hand-engineered feature for all datasets.

Their findings suggest that learning features directly from the dataset itself

may be more advantageous. In our project we have implemented Wang et al.’s

experimental protocols by using their standard processing pipeline and replac-

ing the first stage of feature extraction with an unsupervised feature learning

technique.

4.2 PCA Whitening

The goal of whitening is to make the input less redundant; more formally, our

desiderata are that our learning algorithms sees a training input where

• Features are less correlated with each other.

• Features all have the same variance.

26

Page 41: Project Report Human Action Recognition

Principal component analysis (PCA) is a mathematical procedure that uses

an orthogonal transformation to convert a set of observations of possibly corre-

lated variables into a set of values of linearly uncorrelated variables called prin-

cipal components [27]. PCA is a dimensionality reduction algorithm that can

be used to significantly speed up a unsupervised feature learning algorithm.It

finds a lower-dimensional subspace onto which the data is projected.The new

projected data points are achieved by:

xrot = UTx (4.1)

Where U is the matrix of principal eigenvectors. To reduce the dimensionality

to k dimensions we choose the first k components of xrot. Thus achieving the

first goal of whitening by making the features less correlated.

The second goal of whitening is achieved by rescaling each feature xrot,i by

1/√λi. Concretely the whitened data xPCAwhite is given as:

xPCAwhite,i =xrot,i√λi

(4.2)

where λ is the eigen value.

4.3 Independent Subspace Analysis

Figure 4.1: ISA Network

27

Page 42: Project Report Human Action Recognition

Independent subspace analysis(ISA) [28] is an unsupervised learning algo-

rithm that learns features from unlabelled image patches. An ISA network is a

two-layered network with square and square-root nonlinearities in the first and

the second layers respectively. The weights W in the first layer are learned,

and the weights V of the second layer are fixed. Each of the second layer hid-

den units pools over a small neighbourhood of adjacent first layer units. The

first and second layer units are called simple and pooling units respectively.

Given an input pattern xt the activation of each pooling unit is given by:

pi(xt;W,V ) =

√√√√ m∑k=1

Vik

( n∑j=1

Wkjxtj

)2(4.3)

The parameters W are learnt through finding sparse feature representations

in the second layer by minimizing:

T∑t=1

m∑i=1

pi(xt;W,V ) (4.4)

subject to:

WW T = I (4.5)

where:

W∈ Rk×n

V ∈ Rm×k

{xt}Tt=1: whitened input examples

n: input dimension

k: number of simple units

m: number of pooling units

ISA algorithm is able to learn Gabor filters (Edge detectors) with many fre-

quencies and orientations. Further, it is also able to group similar features in

a group thereby achieving invariances. However, The standard ISA training

algorithm becomes less efficient when input patches are large. Thus, training

this algorithm with high dimensional data, especially video data, takes days to

complete. In order to scale up the algorithm to large inputs, we implemented

28

Page 43: Project Report Human Action Recognition

a convolutional neural network architecture that progressively makes use of

PCA and ISA as sub-units for unsupervised learning. We first trained the

ISA algorithm on small input patches. We then take this learned network and

convolve with a larger region of the input image. The combined responses of

the convolution step are then given as input to the next layer which is also

implemented by another ISA algorithm with PCA as a prepossessing step.

Similar to the first layer, we use PCA to whiten the data and reduce their

dimensions such that the next layer of the ISA algorithm only works with low

dimensional inputs as shown in Fig. 4.2.

Figure 4.2: Stacked ISA

4.4 Stacked ISA for Video Domain

Stacked ISA network can be extended for videos by giving the inputs to the

network as 3D video blocks instead of image patches. More specifically, a

sequence of image patches are flattened into a vector and given to the network

as shown in Fig. 4.3. Finally features from both layers are used as local

features for classification.

29

Page 44: Project Report Human Action Recognition

Figure 4.3: Stacked ISA for video data

4.5 Experimental Setup

In this section we discuss the Hollywood2 dataset used for the evaluation pro-

tocol. The features obtained from the dataset, using unsupervised feature

learning, are utilized for classification of an unlabelled input video. We eval-

uate the features in a bag-of-features based action classification task and use

Support Vector Machine for classification.

4.5.1 Dataset

The system was trained on Hollywood2 human actions dataset. The dataset

consists of 12 classes of human activities distributed over 2517 video clips

and approximately 15.6 hours of video in total. The dataset is composed of

video clips from 69 movies. The training set has two subsets namely: clean

and automatic. Clean training subset has action labels which are manually

verified to be correct. Automatic training subset has noisy action labels which

are collected by means of automatic script-to-video alignment in combination

30

Page 45: Project Report Human Action Recognition

with text based script classification. The dataset has a total of 823 clean

training videos, 810 automatic training videos and 884 test training videos.

Human actions in the dataset are: answer phone, drive car, eat, fight person,

get out car, handshake, hug person, kiss, run, sit down, sit up and stand up.

The performance is evaluated as suggested in [18] i.e., by computing average

precision (AP) for each of the action classes and reporting the mean AP over

all classes. The dataset is summarized in Table 4.1.

Table 4.1: Hollywood2 Action Dataset

Training Subset

(clean)

Training Subset

(automatic)

Test Subset

(clean)

Answer Phone 66 59 64

Drive Car 85 90 102

Eat 40 44 33

Fight Person 54 33 70

Get Out Car 51 40 57

Handshake 32 38 45

Hug Person 64 27 66

Kiss 114 125 103

Run 135 187 141

Sit Down 104 87 108

Sit up 24 26 37

Stand Up 132 133 146

All Samples 823 810 884

4.5.2 Framework for Classification

The features from the test videos are computed from the stacked Indepen-

dent Subspace Analysis(ISA) model as explained in section 4.3. The video

31

Page 46: Project Report Human Action Recognition

sequences are represented as a bag of local spatio-temporal features. In com-

puter vision, the bag-of-words model (BoW model) can be applied to image

classification, by treating image features as words. In document classification,

a bag of words is a sparse vector of occurrence counts of words; that is, a sparse

histogram over the vocabulary. In computer vision, a bag of visual words is

a sparse vector of occurrence counts of a vocabulary of local image features.

BoW model converts vector represented patches to“codewords”. A codeword

can be considered as a representative of several similar patches. One simple

method is performing k-means clustering over all the vectors. Codewords are

then defined as the centres of the learned clusters. The number of the clusters

is the codebook size. Thus, each patch in an image is mapped to a certain

codeword through the clustering process and the image can be represented by

the histogram of the codewords. To increase the precision k-means is initial-

lized three times. For classification, we have used a non-linear Support Vector

Machine [29].

In machine learning, support vector machines (SVMs, also support vector

networks) are supervised learning models with associated learning algorithms

that analyze data and recognize patterns, used for classification and regression

analysis. The basic SVM takes a set of input data and predicts, for each

given input, which of two possible classes forms the output, making it a non-

probabilistic binary linear classifier. Given a set of training examples, each

marked as belonging to one of two categories, an SVM training algorithm

builds a model that assigns new examples into one category or the other. An

SVM model is a representation of the examples as points in space, mapped so

that the examples of the separate categories are divided by a clear gap that

is as wide as possible. New examples are then mapped into that same space

and predicted to belong to a category based on which side of the gap they fall

on. A support vector machine constructs a hyperplane or set of hyperplanes

in a high- or infinite-dimensional space, which can be used for classification,

32

Page 47: Project Report Human Action Recognition

regression, or other tasks. In addition to performing linear classification, SVMs

can efficiently perform non-linear classification using what is called the kernel

trick, implicitly mapping their inputs into high-dimensional feature spaces. In

the non-linear classification the dot product used for the linear classification

is replaced by a non-linear kernel function. In the project we used χ2−kernel

[30]:

K(Hi, Hj) = exp

(− 1

2A

V∑n=1

(hin − hjn)2

hin + hjn

)(4.6)

where Hi = hin and Hj = hjn are the frequency histograms of word occur-

rences, V is the vocabulary size and A is the mean value of distances between

all training samples. Multi-class SVM aims to assign labels to instances by

using support vector machines, where the labels are drawn from a finite set

of several elements. The dominant approach for doing so is to reduce the sin-

gle multi-class problem into multiple binary classification problems. A binary

classifiers is used which distinguishes between one of the labels and the rest

(one-versus-all). Classification of new instances for the one-versus-all case is

done by a winner-takes-all strategy, in which the classifier with the highest

output function assigns the class.

4.5.3 Results

The system was trained on Hollywood2 dataset using unsupervised feature

learning technique and then classified using Support Vector Machines. The

mean average precision, recall, accuracy and F-Score obtained for the system

using non-overlapping dense sampling are summarized in the following Table

4.2. If 50% overlap dense sampling is performed with a machine with 24 Gb

RAM, the results are improved by 8.6 %.

33

Page 48: Project Report Human Action Recognition

Table 4.2: Performance of System

Mean AP Mean Recall Mean Accuracy F-Score

44.776% 79.419% 92.232% 0.573

4.6 Summary

In the chapter previous approaches used in activity recognition and their

drawbacks have been discussed. Recent advances made in the field of neu-

ral networks such as stacked Independent Subspace Analysis for Unsupervised

Feature-learning were presented. Stacked ISA was trained on Hollywood2 Ac-

tion Dataset. The spatio-temporal features obtained were used to recognize 12

categories of human activities. The results obtained showed that Unsupervised

Feature-Learning outperforms many state-of-the-art methods.

34

Page 49: Project Report Human Action Recognition

Chapter 5

Snapshots

Snapshots of the Graphic User Interface for the purpose of demonstrating

recognition of bending and side galloping activities using MHI methodology

and snapshot of the system’s performance result when trained Hollywood2

Action Dataset using Unsupervised Feature-Learning Technique and SVM for

classification is given below:

Start Screen Available Tools

Figure 5.1: Graphical User Interface

35

Page 50: Project Report Human Action Recognition

Figure 5.2: Bending Activity

Motion History Image Motion Energy Image

Figure 5.3: MHI and MEI for Bending Activity

36

Page 51: Project Report Human Action Recognition

Figure 5.4: Side Galloping Activity

Motion History Image Motion Energy Image

Figure 5.5: MHI and MEI for Side Galloping Activity

Figure 5.6: Training Completion

37

Page 52: Project Report Human Action Recognition

New input Classification Result

Figure 5.7: Test Bending Activity

New input Classification Result

Figure 5.8: Test Side Galloping Activity

Figure 5.9: Actions in Hollywood2 Dataset

38

Page 53: Project Report Human Action Recognition

Figure 5.10: Compute Features

Figure 5.11: Classification Results

39

Page 54: Project Report Human Action Recognition

Chapter 6

Conclusion and Future Work

In the report various approaches used for human activity recognition were dis-

cussed and two different approaches for activity recognition were implemented

for recognizing activities performed in a given video sequence. The project was

divided into two phases. The first phase of the project focussed on Temporal

Template Matching techniques for activity recognition. Motion History Im-

age(MHI Methodology) was examined for non repetitive activities and from the

MHI and MEI images generated, features were extracted using 7-Hu Moments.

These features were then used to train the system to classify side galloping

and bending activity by using k-NN algorithm. Thereafter the drawbacks of

MHI methodology were presented and it was shown that MHI method fails

to recognize repetitive and over-writing activities. In order to address this

issue a novel methodology was proposed and implemented and was shown to

be successful when over-writing and repetitive actions are encountered. The

algorithm of the proposed methodology and simulated results have also been

provided for repetitive hand waving action.

The second phase of the project focussed on Unsupervised Feature-

Learning Techniques which is a recent development in the field of machine

learning. Various previous works using hand-designed features for activity

recognition and their drawbacks were discussed. The unsupervised feature

40

Page 55: Project Report Human Action Recognition

learning approach using the concepts of stacked Independent Subspace Anal-

ysis(ISA) and PCA Whitening is implemented for spatio-temporal feature ex-

traction from Hollywood2 Action Dataset. The features learnt are used to

classify an unlabelled test video data into it’s class of activity using Support

Vector Machines. The system was trained and tested for 12 different human

activities.

As part of future work we would like to extend the MHI methodology for

unsupervised feature-extraction techniques. In other words after the gener-

ation of Motion History and Motion Energy Images, instead of using 7-Hu

moments for feature extraction stage, we will use stacked Independent Sub-

space Analysis for learning features. Thus, replacing the hand-coded features

by a more efficient method. Also the k-NN algorithm used in the classification

for MHI method is naive and is difficult to implement for large training data.

Therefore, we will replace k-NN algorithm by a more efficient classification

technique such as Support Vector Machine.

41

Page 56: Project Report Human Action Recognition

References

[1] L. G. Shapiro and G. C. Stockman, Computer Vision. Prentice Hall,

2001.

[2] Wikipedia. Activity recognition. [Online]. Available: http://en.wikipedia.

org/wiki/Activity recognition

[3] E. Kim, S. Helal, and D. Cook, “Human activity recognition and pattern

discovery,” Pervasive Computing, IEEE, vol. 9, no. 1, pp. 48–53, Jan.

2010.

[4] J. Aggarwal and M. Ryoo, “Human activity analysis: A review,” ACM

Computing Surveys (CSUR), vol. 43, pp. 16:1–16:43, april 2011.

[5] A. Yilmaz and M. Shah, “Recognizing human actions in videos acquired

by uncalibrated moving cameras,” in Proceedings of the Tenth IEEE In-

ternational Conference on Computer Vision (ICCV’05), vol. 1. IEEE

Computer Society, 2005, pp. 150–157.

[6] E. Shechtman and M. Irani, “Space-time behavior based correlation,” in

IEEE Conference on Computer Vision & Pattern Recognition (CVPR),

vol. 1. IEEE Computer Society, June 2005, pp. 405–412.

[7] S. M. Khan and M. Shah, “Detecting group activities using rigidity of for-

mation,” in Proceedings of the 13th annual ACM international conference

on Multimedia, 2005, pp. 403–406.

42

Page 57: Project Report Human Action Recognition

[8] N. Nguyen, D. Phung, S. Venkatesh, and H. Bui, “Learning and detect-

ing activities from movement trajectories using the hierarchical hidden

markov model,” in IEEE Computer Society Conference on Computer Vi-

sion & Pattern Recognition CVPR, vol. 2, Jan. 2005, pp. 955–960.

[9] S. W. Joo and R. Chellappa, “Attribute grammar-based event recogni-

tion and anomaly detection,” in IEEE Computer Society Conference on

Computer Vision & Pattern Recognition Workshop CVPRW, Jan. 2006,

pp. 107–112.

[10] A. Gupta, P. Srinivasan, J. Shi, and L. Davis, “Understanding videos,

constructing plots learning a visually grounded storyline model from an-

notated videos,” in IEEE Computer Society Conference on Computer Vi-

sion & Pattern Recognition CVPR, Jan. 2009, pp. 2012–2019.

[11] Y. Ke, R. Sukthankar, and M. Hebert, “Spatio-temporal shape and flow

correlation for action recognition,” in IEEE Computer Society Conference

on Computer Vision & Pattern Recognition (CVPR), June 2007, pp. 1–8.

[12] I. Laptev, “On space-time interest points,” International Journal of Com-

puter Vision, vol. 64, pp. 107–123, Sep. 2005.

[13] A. F. Bobick and J. W. Davis, “The recognition of human movement

using temporal templates,” IEEE Transaction on Pattern Analysis and

Machine Intelligence, vol. 23, pp. 257–267, Mar. 2001.

[14] M. A. R. Ahad, J. K. Tan, H. Kim, and S. Ishikawa, “Motion history im-

age: its variants and applications,” Machine Vision Application, vol. 23,

pp. 255–281, Mar. 2012.

[15] D. G. Lowe, “Object recognition from local scale-invariant features,” in

The Proceedings of the Seventh IEEE International Conference on Com-

puter Vision ICCV, June 1999, pp. 1150–1157.

43

Page 58: Project Report Human Action Recognition

[16] N. Dalal and B. Triggs, “Histograms of oriented gradients for human

detection,” in IEEE Computer Society Conference on Computer Vision

& Pattern Recognition CVPR, July 2005, pp. 886–893.

[17] Q. Le, W. Zou, S. Yeung, and A. Ng, “Learning hierarchical invariant

spatio-temporal features for action recognition with independent subspace

analysis,” in IEEE Computer Society Conference on Computer Vision &

Pattern Recognition CVPR, June 2011, pp. 3361–3368.

[18] M. Marszaek, I. Laptev, and C. Schmid, “Actions in context,” in IEEE

Conference on Computer Vision & Pattern Recognition CVPR, June

2009, pp. 2929–2936.

[19] M. Rodriguez, J. Ahmed, and M. Shah, “Action mach a spatio-temporal

maximum average correlation height filter for action recognition,” in IEEE

Computer Society Conference on Computer Vision & Pattern Recognition

CVPR, May 2008, pp. 1–8.

[20] C. Schuldt, I. Laptev, and B. Caputo, “Recognizing human actions: a

local svm approach,” in Proceedings of the 17th International Conference

on Pattern Recognition ICPR, June 2004, pp. 32–36 Vol.3.

[21] J. Liu, J. Luo, and M. Shah, “Recognizing realistic actions from videos

in the wild,” in IEEE Computer Society Conference on Computer Vision

& Pattern Recognition CVPR, June 2009, pp. 1996–2003.

[22] M.-K. Hu, “Visual pattern recognition by moment invariants,” IRE

Transactions on Information Theory, vol. 8, pp. 179–187, Mar. 1962.

[23] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions

as space-time shapes,” in The Tenth IEEE International Conference on

Computer Vision (ICCV’05), 2005, pp. 1395–1402.

44

Page 59: Project Report Human Action Recognition

[24] Wikipedia. K-nearest neighbors algorithm. [Online]. Available: https:

//en.wikipedia.org/wiki/K-nearest neighbors algorithm

[25] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks

for human action recognition,” IEEE Transactions on Pattern Analysis

& Machine Intelligence, vol. 35, pp. 221–231, Jan. 2013.

[26] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, C. Schmid et al., “Evaluation

of local spatio-temporal features for action recognition,” in BMVC 2009-

British Machine Vision Conference, Sep. 2009, pp. 545–554.

[27] Wikipedia. Principal component analysis. [Online]. Available: https:

//en.wikipedia.org/wiki/Principal component analysis

[28] A. Hyvarinen, J. Hurri, and P. O. Hoyer, Natural image statistics.

Springer, 2009.

[29] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector

machines,” ACM Transactions on Intelligent Systems and Technology

(TIST), vol. 2, no. 3, pp. 27:1–27:27, May 2011.

[30] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic

human actions from movies,” in IEEE Conference on Computer Vision

& Pattern Recognition CVPR, June 2008, pp. 1–8.

45

Page 60: Project Report Human Action Recognition

Project Detail

Student Details

Student Name Arpit Jain

Registration Number 090911345

Section/Roll No. A/18

Email Address [email protected]

Phone No.(M) 9901536320

Student Details

Student Name Satakshi Rana

Registration Number 090911281

Section/Roll No. A/14

Email Address [email protected]

Phone No.(M) 9590329141

Project Details

Project title Human Activity Recognition using Temporal Templates

and Unsupervised Feature-Learning Techniques

Project Duration 5 Months

Date of Reporting 07-01-2013

Organization Details

Organization Name Manipal Institute of Technology

Full Postal Address MIT,Manipal

Website Address www.manipal.edu

Internal Guide Details

Faculty Name Dr. Sanjay Singh

Full Contact Address with PIN Code Dept of I&CT, MIT, Manipal-576104

Email Address [email protected]

46


Recommended