What is an Event?
Dictionary.com definition:
“something that occurs in a certain place during
a particular interval of time.”
Sports: shot on goal Surveillance: enter in car Movies: drink
Examples from various domains:
Importance of Human Actions• Most videos recorded and downloadable from the web
contain people; the semantic is therefore defined by people behavior.
• Third generation video-surveillance systems benefit from automatic interpretation of human actions and behaviors.
Definition 1: physical body motion.
Definition 2: interaction with environment (objects or people) on a specific purpose.
Human action recognition challenges
• Actor appearance variation. Gender, clothing body posture and size.
• Scale, illumination and background change as in object categorization.
• Semantically different but perceptually
similar actions (e.g. running and jogging).
• Different ways of executing the same action. This results in limbs trajectory and speed change.
time time time time
Are actions space-time objects?
We already know how to detect instances of object categories in static images.
How do we take advantage of time to describe dynamic concepts (i.e. human actions)?
Bag-of-wordsSVM classifierrunning
walking
jogging
handwaving
handclappingboxing
Visual Dictionary
…
Bag-of-featuresInterest points extraction
Framework Overview:• Same three steps of object categorization (feature extraction, dictionary formation, classification)
• Features detector and descriptor here differ!
boxing
hand waving
Space-time Interest Point Detectors
• Consider the video as a volume of pixels built of subsequent frames.
• Automatically find a local neighborhood “where the action happens”.
• Advantages:
• robustness w.r.t. partial body occlusions.
• almost insensitive of background clutter.
• no tracking or beforehand segmentation required.
• Problems:
• how we define an “interesting location”?
• how we perform scale selection (in space and time!)
What neighborhoods to consider?
• A solution: extend the Harris-corner operator to time.
Distinctive
neighborhoods
High image variation
in space and time
Look at the distribution of
the gradient
Covariance
Space-time
gradient
Spatial scale , temporal scale
Space-time Interest Point Detectors
• Represent the video as a function f(x,y,t).
• Compute gaussian derivatives L with kernel g using covariance Σ
Distribution of within a local neighborhood
Second-moment
matrix
Local maxima of H over (x,y,t)
(similar to Harris operator [Harris and Stephens, 1988])
High variation of large eigenvalues of
Space-time Interest Point Detectors
Spatio-temporal corners
• Spatial corners: the second
moment descriptor can be thought
as the covariance matrix of a two-
dimensional distribution of image
orientation.• Large eigenvalues indicate a
strong gradient (variation) in both
image direction.
• Spatio-temporal corners are located in
region that exhibit a high variation in all
three directions.
• A spatial corner inverting its motion
(high temporal gradient variation) is a spation-temporal corner..
Periodic local motion detector
...8,4,2 , ...8,4,2
2
1 )(
/4 ,)2sin(
)2cos(
))(())((
222
22
22
2/)(
2
/
/
22
yx
t
od
t
ev
odev
eg
eth
eth
hgIhgIR
Idea: time and space have to be treated differently
• Problem with Laptev’s detector is the excessive sparseness of the interest points.
• Performing automatic scale selection make this issue even worse!
• Dollar proposed to use a Gabor filter in time and a Gaussian filter in space.
• The Gaussian filter perform the spatial scale(σ) selection of the event by
smoothing each frame.
• The Gabor filter is a bandpass filter which give high responses to periodic
variation of the signal. The period or temporal scale depends on
We extend this operator by
using multiple scales in
space and time.
This interest point operator
provides an extremely denseinformation sampling in
interesting locations.
The spatial scale refers to the size of the moving object.• We can detect the same event observing it at different distances• We are able to select events of different spatial sizes (e.g. head, legs)
The temporal scale refers to the speed at which the object moves.• We are able to detect the same event performed at a different speed.• We are able to detect the proper scale for different events.
The importance of multiple scales
...consider a walking person ...at a certain scale only
the torso motion is detected ...although legs and arms
movement are undoubtfully
more informative.
Detector response (large scale) Detector response (small scale)
Note: red denotes a high detector response at a given space and time.
M
• A volume of pixel is extracted around each local maxima of the detector
response R(x,y,t).
• We need to provide a suitable description of the appearance and motion of
each interest point neighborhood.
• Two image measurement are exploited:
• Optic flow (motion oriented)
• 3D Gradient (appearance oriented)
• As in the SIFT descriptor position dependent histograms are used.
)/(tan 1
22
xy
yx
VV
VVM
Optical flow
)/(tan
)/(tan
1
221
222
xy
yxt
tyx
GG
GGG
GGGM
3D Gradient
Descriptors for spatio-temporal patches
Descriptors combination
Descriptor Accuracy
3D Grad 93%
HoF 95%
3DGrad + HoF 96%
Descriptor Accuracy
3D Grad 90%
HoF 84%
3DGrad + HoF 91%
• Descriptors perform differently on each action class.
• To exploit complementarity two combinations are proposed.
1. Descriptor concatenation.
2. Bag of Words concatenation.
• Codebook combination obtains the best performance.
M
Visual Dictionaries
…
… …
Visual Dictionary
3DGrad + HoF BoW
3DGrad_HoF
3DGrad
HoF
BoW
ST Patch
ST Patch
Descriptor
Descriptors
Action Representation
Action Representation
Descriptor combination strategy
Effective codebooks:• Spatio-temporal descriptors span an extremely high-dimensional feature space
• Our dense multi-scale sampling produce a non-uniform feature space.
K-means clusters are attracted
towards densely populated regions.
• Less dense zone are not represented
correctly.
Radius-based clustering [Jurie ICCV05]
exploits mode finding to place cluster
centers.
• More accurate coding of the feature
space.
Note: to reduce the uncertainty we perform soft assignment.
Results: codebook performance
KTH
Informative
mid-frequency terms.Non-informative
high-frequency terms.
codebook size
Words are sorted by frequency and added incrementally to dictionary.
Weizmann
codebook size
Non-informative
high-frequency terms.
Informative
mid-frequency terms.
Results: codebook performance
Words are sorted by frequency and added incrementally to dictionary.
Results: dataset
• KTH
• 25 actors
• 6 actions
• 4 viewing conditions
• 2931 clips
• Weizmann
• 9 actors
• 10 actions
• 1 viewing conditions• 93 clips
The approach is tested on two standard datasets
Weizmann dataset is considered less challenging for the reduce variability of
shooting conditions and amount of actors.
Results: comparison with the state of the art
Method KTH Weizmann
Our method 92.57 95.41
Laptev et al. - HoG ['08] 81.6 -
Laptev et al. - HoF [‘08] 89.7 -
Dollár et al. [‘05] 81.2 -
Wong e Cipolla [‘07] 86.6 -
Scovanner et al. [‘07] - 82.6
Niebles et al. [‘08] 83.3 90
Liu et al. [‘08] - 90.4
Kläser et al. [’08] 91.4 84.3
Willems et al. [‘08] 84.2 -
We compare our results by using the same methodology to measure the
Improvement w.r.t. to the current state-of-the-art
walking
running
Real video footage
We test our detector on a sequence taken in a garage.
A sliding temporal window is used to perform the segmentation.
• Online video search and video indexing
• Events characterized by an evolution of scenes, objects
and actions over time.
• 56 events are defined in LSCOM.
• Event examples in the news domain:
Airplane Flying Car Exiting
Recognizing generic video events
• A possible approach, which exploit object recognition is to detect interest object,
track over time, and model spatio-temporal dynamics.
• Some events are well defined by the presence and motion of an object.
Object Detection & Localization
Tracking Inference
“Airplane
Landing”
?
Event Recognition: Object Tracking
• Hard to detect events without explicit object motion, such as Riot
feature
extraction
concept
detectors
EMD
distance
......
Plug the EMD into
a rbf kernel and use
it in a SVM to predict
category.
Event recognition: exploit dynamic concept evolution
• Global low level feature are extracted such as edge histograms, Gabor texture descriptors and
grid color moments.
• 108 concent detectors are trained on this features.
• Each frame is represented by 108 concept scores.
• Shots similarity is evaluated by computing Earth Mover’s Distance.
• Train detectors on
low-level features
• Mid-level semantic concept
feature is more robust
• Columbia developed and
released 374 semantic concept
detectors. Detectors are
available online.
Concept Detectors
Content Representation: Mid-level Semantic Concept Scores
Image Database
+-
http://www.ee.columbia.edu/ln/dvmm/columbia374/
Earth Mover’s Distance (EMD): Approach
dij
Supplier P is with a
given amount of goods
Receiver Q is with a
given limited capacity
Weights:
Solved by linear programming
• Temporal shift:
a frame at the beginning of P can be mapped to a frame at the end of Q
• Scale variations:
a frame from P can be mapped to multiple frames in Q
11/2
1/2
Experiments:
Keyframe based feature performance
0,0
0,2
0,4
0,6
0,8
1,0
Car
Cra
sh
Pro
test
Gre
etin
g
Car
Exitin
g
Com
bat
Mar
ching
Rio
t
Run
ning
Sho
otin
g
Walking
(ave
rage
)
concept scores Gabor texture
edge direction histogram color moment
Dataset: TRECVID2005Evaluation Metric: Average
Precision
References
On space-time interest points, Laptev, I. IJCV 2005
Behavior recognition via sparse spatio-temporal features, Dollar, P., Rabaud, V.,
Cottrel, G. and Belongie, S. ICCV VS-PETS 2005
Effective Codebooks for Human Action Recognition, Ballan, L., Bertini, M., Del
Bimbo , A.,Seidenari, L. and Serra, G. ICCV VOEC 2009
Video Event Recognition using kernel methods with multilevel temporal
alignement, Dong Xu, Shih-Fu Chang, TPAMI 2008