Event Detection and Human Behavior Recognition...Human action recognition challenges • Actor...

EVENT DETECTION AND HUMAN

BEHAVIOR RECOGNITION

Ing. Lorenzo Seidenari

e-mail: [email protected]

mailto:[email protected]

What is an Event?

Dictionary.com definition:

“something that occurs in a certain place during

a particular interval of time.”

Sports: shot on goal Surveillance: enter in car Movies: drink

Examples from various domains:

Importance of Human Actions• Most videos recorded and downloadable from the web

contain people; the semantic is therefore defined by people behavior.

• Third generation video-surveillance systems benefit from automatic interpretation of human actions and behaviors.

Definition 1: physical body motion.

Definition 2: interaction with environment (objects or people) on a specific purpose.

Human action recognition challenges

• Actor appearance variation. Gender, clothing body posture and size.

• Scale, illumination and background change as in object categorization.

• Semantically different but perceptually

similar actions (e.g. running and jogging).

• Different ways of executing the same action. This results in limbs trajectory and speed change.

time time time time

Are actions space-time objects?

We already know how to detect instances of object categories in static images.

How do we take advantage of time to describe dynamic concepts (i.e. human actions)?

Bag-of-wordsSVM classifierrunning

walking

jogging

handwaving

handclappingboxing

Visual Dictionary

…

Bag-of-featuresInterest points extraction

Framework Overview:• Same three steps of object categorization (feature extraction, dictionary formation, classification)

• Features detector and descriptor here differ!

boxing

hand waving

Space-time Interest Point Detectors

• Consider the video as a volume of pixels built of subsequent frames.

• Automatically find a local neighborhood “where the action happens”.

• Advantages:

• robustness w.r.t. partial body occlusions.

• almost insensitive of background clutter.

• no tracking or beforehand segmentation required.

• Problems:

• how we define an “interesting location”?

• how we perform scale selection (in space and time!)

What neighborhoods to consider?

• A solution: extend the Harris-corner operator to time.

Distinctive

neighborhoods

High image variation

in space and time

Look at the distribution of

the gradient

Covariance

Space-time

gradient

Spatial scale , temporal scale


• Represent the video as a function f(x,y,t).

• Compute gaussian derivatives L with kernel g using covariance Σ

Distribution of within a local neighborhood

Second-moment

matrix

Local maxima of H over (x,y,t)

(similar to Harris operator [Harris and Stephens, 1988])

High variation of large eigenvalues of


Spatio-temporal corners

• Spatial corners: the second

moment descriptor can be thought

as the covariance matrix of a two-

dimensional distribution of image

orientation.• Large eigenvalues indicate a

strong gradient (variation) in both

image direction.

• Spatio-temporal corners are located in

region that exhibit a high variation in all

three directions.

• A spatial corner inverting its motion

(high temporal gradient variation) is a spation-temporal corner..

Periodic local motion detector

...8,4,2 , ...8,4,2

2

1 )(

/4 ,)2sin(

)2cos(

))(())((

222

22

22

2/)(

2

/

/

22

yx

t

od

t

ev

odev

eg

eth

eth

hgIhgIR

Idea: time and space have to be treated differently

• Problem with Laptev’s detector is the excessive sparseness of the interest points.

• Performing automatic scale selection make this issue even worse!

• Dollar proposed to use a Gabor filter in time and a Gaussian filter in space.

• The Gaussian filter perform the spatial scale(σ) selection of the event by

smoothing each frame.

• The Gabor filter is a bandpass filter which give high responses to periodic

variation of the signal. The period or temporal scale depends on

We extend this operator by

using multiple scales in

space and time.

This interest point operator

provides an extremely denseinformation sampling in

interesting locations.

The spatial scale refers to the size of the moving object.• We can detect the same event observing it at different distances• We are able to select events of different spatial sizes (e.g. head, legs)

The temporal scale refers to the speed at which the object moves.• We are able to detect the same event performed at a different speed.• We are able to detect the proper scale for different events.

The importance of multiple scales

...consider a walking person ...at a certain scale only

the torso motion is detected ...although legs and arms

movement are undoubtfully

more informative.

Detector response (large scale) Detector response (small scale)

Note: red denotes a high detector response at a given space and time.

M

• A volume of pixel is extracted around each local maxima of the detector

response R(x,y,t).

• We need to provide a suitable description of the appearance and motion of

each interest point neighborhood.

• Two image measurement are exploited:

• Optic flow (motion oriented)

• 3D Gradient (appearance oriented)

• As in the SIFT descriptor position dependent histograms are used.

)/(tan 1

22

xy

yx

VV

VVM

Optical flow

)/(tan

)/(tan

1

221

222

xy

yxt

tyx

GG

GGG

GGGM

3D Gradient

Descriptors for spatio-temporal patches

Descriptors combination

Descriptor Accuracy

3D Grad 93%

HoF 95%

3DGrad + HoF 96%

Descriptor Accuracy

3D Grad 90%

HoF 84%

3DGrad + HoF 91%

• Descriptors perform differently on each action class.

• To exploit complementarity two combinations are proposed.

1. Descriptor concatenation.

2. Bag of Words concatenation.

• Codebook combination obtains the best performance.

M

Visual Dictionaries

…

… …

Visual Dictionary

3DGrad + HoF BoW

3DGrad_HoF

3DGrad

HoF

BoW

ST Patch

ST Patch

Descriptor

Descriptors

Action Representation

Action Representation

Descriptor combination strategy

Effective codebooks:• Spatio-temporal descriptors span an extremely high-dimensional feature space

• Our dense multi-scale sampling produce a non-uniform feature space.

K-means clusters are attracted

towards densely populated regions.

• Less dense zone are not represented

correctly.

Radius-based clustering [Jurie ICCV05]

exploits mode finding to place cluster

centers.

• More accurate coding of the feature

space.

Note: to reduce the uncertainty we perform soft assignment.

Results: codebook performance

KTH

Informative

mid-frequency terms.Non-informative

high-frequency terms.

codebook size

Words are sorted by frequency and added incrementally to dictionary.

Weizmann

codebook size

Non-informative

high-frequency terms.

Informative

mid-frequency terms.

Results: codebook performance

Words are sorted by frequency and added incrementally to dictionary.

Results: dataset

• KTH

• 25 actors

• 6 actions

• 4 viewing conditions

• 2931 clips

• Weizmann

• 9 actors

• 10 actions

• 1 viewing conditions• 93 clips

The approach is tested on two standard datasets

Weizmann dataset is considered less challenging for the reduce variability of

shooting conditions and amount of actors.

Results: comparison with the state of the art

Method KTH Weizmann

Our method 92.57 95.41

Laptev et al. - HoG ['08] 81.6 -

Laptev et al. - HoF [‘08] 89.7 -

Dollár et al. [‘05] 81.2 -

Wong e Cipolla [‘07] 86.6 -

Scovanner et al. [‘07] - 82.6

Niebles et al. [‘08] 83.3 90

Liu et al. [‘08] - 90.4

Kläser et al. [’08] 91.4 84.3

Willems et al. [‘08] 84.2 -

We compare our results by using the same methodology to measure the

Improvement w.r.t. to the current state-of-the-art

walking

running

Real video footage

We test our detector on a sequence taken in a garage.

A sliding temporal window is used to perform the segmentation.

• Online video search and video indexing

• Events characterized by an evolution of scenes, objects

and actions over time.

• 56 events are defined in LSCOM.

• Event examples in the news domain:

Airplane Flying Car Exiting

Recognizing generic video events

• A possible approach, which exploit object recognition is to detect interest object,

track over time, and model spatio-temporal dynamics.

• Some events are well defined by the presence and motion of an object.

Object Detection & Localization

Tracking Inference

“Airplane

Landing”

?

Event Recognition: Object Tracking

• Hard to detect events without explicit object motion, such as Riot

feature

extraction

concept

detectors

EMD

distance

......

Plug the EMD into

a rbf kernel and use

it in a SVM to predict

category.

Event recognition: exploit dynamic concept evolution

• Global low level feature are extracted such as edge histograms, Gabor texture descriptors and

grid color moments.

• 108 concent detectors are trained on this features.

• Each frame is represented by 108 concept scores.

• Shots similarity is evaluated by computing Earth Mover’s Distance.

• Train detectors on

low-level features

• Mid-level semantic concept

feature is more robust

• Columbia developed and

released 374 semantic concept

detectors. Detectors are

available online.

Concept Detectors

Content Representation: Mid-level Semantic Concept Scores

Image Database

+-

http://www.ee.columbia.edu/ln/dvmm/columbia374/

Earth Mover’s Distance (EMD): Approach

dij

Supplier P is with a

given amount of goods

Receiver Q is with a

given limited capacity

Weights:

Solved by linear programming

• Temporal shift:

a frame at the beginning of P can be mapped to a frame at the end of Q

• Scale variations:

a frame from P can be mapped to multiple frames in Q

11/2

1/2

Experiments:

Keyframe based feature performance

0,0

0,2

0,4

0,6

0,8

1,0

Car

Cra

sh

Pro

test

Gre

etin

g

Car

Exitin

g

Com

bat

Mar

ching

Rio

t

Run

ning

Sho

otin

g

Walking

(ave

rage

)

concept scores Gabor texture

edge direction histogram color moment

Dataset: TRECVID2005Evaluation Metric: Average

Precision

Experiments:

EMD concept performance

References

On space-time interest points, Laptev, I. IJCV 2005

Behavior recognition via sparse spatio-temporal features, Dollar, P., Rabaud, V.,

Cottrel, G. and Belongie, S. ICCV VS-PETS 2005

Effective Codebooks for Human Action Recognition, Ballan, L., Bertini, M., Del

Bimbo , A.,Seidenari, L. and Serra, G. ICCV VOEC 2009

Video Event Recognition using kernel methods with multilevel temporal

alignement, Dong Xu, Shih-Fu Chang, TPAMI 2008

Date post:	20-May-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Event Detection and Human Behavior Recognition...Human action recognition challenges • Actor...

Documents