Activity Recognition
Interactive Systems Laboratories, Universität Karlsruhe (TH)1
2008-01-29
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Termine
� Fr, 06.02.2009
Project 3: Student Presentations
� Mo, 09.02.2009
Audio-Visual Speech Recognition
� Fr, 13.02.2009
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
� Fr, 13.02.2009
Wiederholung
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Overview
� Introduction
� Motivation
� Typical Approaches
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
3
� Examples
� Recognition of Human Movements using Temporal Templates
� Layered HMMs for Activity Recognition
� Activities in offices (2 examples)
� Automatic Segmentation of Activity Zones in a lab
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Introduction
� Why activity recognition?
� Gain a higher level understanding of the scene
� Not just: Person locations, movement, orientation
� Rather:
� What are these persons doing (walking, sitting, working, hiding)
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
4
hiding)
� how are they doing it
� what is going in the scene (meeting, party, telephone conversation, etc…)
� Useful for video indexing/analysis, smartrooms, surveillance, etc.
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Types of activities
� Single person:
� Activities: Jump, kneel,
pick, put, run, sit, stand,
walk
� But also: step left, step
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
5
right, backhand, swing,
slide
� Usually video analysis
on close-up views
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Types of activities
� Small groups (meetings):
� Individual activities
� Speaking, writing, listening,
walking, standing up, sitting
down, “fidgeting”,…
� Group activities
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
6
� Group activities
� Meeting start, end,
discussion, presentation,
monologue, dialogue, white
board, note-taking
� Often audio-visual cues
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Types of activities
� Rooms (office, kitchen):
� Events:
� Entering/leaving the room
� working on the desk
� making a phone call
� Making coffee
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
7
� Activities composed by events (in one or more offices):
� phone conference
� meeting
� short interrupt / discussion
� fetching printouts from the printer in the lab
� Here also, audio-visual cues, but coarser in nature.
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Types of activities
� Outdoor activities:
� Mostly surveillance, for ex.
In parking lots, in front of
stores, in train stations:
� Car enter, car leave,
person enter, pickup, drop
object (bomb?), hide,
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
8
object (bomb?), hide,
follow person, etc…
� Recently became very
popular field because of the
“fight against terror”.
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Approaches
� Classification problem, typically observation over
time
� Similar to gesture recognition
� Typical classifiers
� HMMs and variants, e.g. Coupled HMMs, Layered
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
9
� HMMs and variants, e.g. Coupled HMMs, Layered
HMMs
� Dynamic Bayesian Networks (DBN)
� But also: clustering, template matching, SVM, Neural
Nets
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)
Example approaches
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
10
Example approaches
Single person activities
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)
Recognition of Human Movement using
Temporal Templates [Bobick01]
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
11
� Objective: Classify a set activities based on a person’s motion
� Input:
� Several close-up camera views
� Static, indoor scene
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Motivation
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
12
� Even with almost no structure in the video, humans can recognize activity through motion (walking, sitting down)
� No 2D/3D reconstruction of body model necessary
� Need to know:
� Where is motion?
� How is it moving?
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Motion Features
� Motion Energy Image (MEI)
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
13
� Captures the information: Where is motion
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)MEI
� Let I(x,y,t) be an image sequence
� Let D(x,y,t) be a binary image sequence indicating
regions of motion (e.g. difference image)
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
14
� The MEI is defined as:
� τ is the size of the observation window (1-2 secs)
U1
0
),,(),,(−
=
−=
τ
τ
i
ityxDtyxE
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Motion Features
� Motion History Image
(MHI)
� Captures the information: How is
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
15
� Captures the information: How is
motion done
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)MHI
−−
=
=
.)1)1,,(,0max(
1),,(),,(
otherwisetyxH
tyxDiftyxH
τ
τ
τ
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
16
� Result: More recently moving pixels are brighter
� Note: MEI can be generated by thresholding the MHI
above zero
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Why both MEI and MHI?
� For some moves,
MEIs are similar,
for others, MHIs
are similar
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
17
� MEI and MHI
capture different
characteristics of
motion
� “where” and “how”
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Multi-view MEI/MHI
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
18
� Some motions have similar temporal templates
� � Use templates from several viewing angles
MEI of sitting movement over 90 degree viewing angle
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Matching Temporal Templates
� Training
� Collect training examples for each move from a variety
of viewing angles
� Generate MEIs and MHIs
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
19
� Calculate scale and orientation invariant features (Hu-
moments) on images � Feature vector
� Build statistical model of the moments (mean,
covariance matrix)
� Recognition:
� Calculate Mahalanobis distance between moment
description of input and each of the stored movements.
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Moments as Shape Descriptors
� Idea: a density distribution (e.g. an image) is well described by its
moments
� � use statistical properties (moments) to describe their shape
� Two-dimensional (p+q)th order moments of a density distribution
function:
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
20
� Central moments (invariant to translation):
� where
(centroids)
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Hu-moments [Hu 1962]
� Goal: Find translation-, scale- and rotation-invariant moments to do pattern recognition
� Central moments (first four orders):
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
21
� Normalize for scale-invariance:
� where
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Hu-moments
� The first seven orientation invariant Hu-Moments
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
22
� Hu-Moments are translation-, scale- and rotation-invariant.
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Recognized Moves
� 18 aerobic exercises
� Several executions
� Seven views (-90 to
+90 deg, 30deg
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
23
+90 deg, 30deg
increments)
� Results
� With 1 camera: 12 out
of 18 correct
� With 2 cameras: 15 out
of 18 correct
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Temporal Templates in 3D
� Temporal Templates can also be applied to voxels
� Analysis of moments with a PCAclassifier lead to robust results
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
24
� Classification of 8 actions:
� raising hand,
� sitting down,
� waving hands,
� crouching down,
� standing up,
� punching,
� kicking or
� jumping.
Motion Energy Volume
Motion History Volume
[System by UPC, 2006]
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)
Small Group Activities
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
25
Small Group Activities
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)
Layered Representations for Human Activity
Recognition [Oliver02]
� Target:� Recognize complex human activities over longer period of time
(“context” in an office setting).
� Types of context (situations, activities, etc):� Phone conversation
� Face to face conversation
� Presentation
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
26
� Distant conversation
� Others…
� Sensors:� Binaural microphones
� USB camera
� Keyboard and mouse
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Hierarchical approach
� Problem with normal HMMs:
� Lack of structure
� Large parameter space
� Overfitting on long sequences with little training data
�Bad generalization
� Fusion of various streams possible, but multiplies
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
27
� Fusion of various streams possible, but multiplies required parameters � need even more training data
� Solution:
� Hierarchical (Layered) Hidden Markov Models (LHMMs)
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Layered HMMs
Activities: Phone conversation,
Face to face conversation,
Presentation, Distant
conversation
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
28
Output of lower layer is
input to higher layer
Classes: Nobody present, one
person, one active person,
multiple people.
Music, silence, phone ring
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Temporal Granularity
� HMMs at level L use sliding windows of TL
samples
� Data in time window at level L is analyzed
� likelihoods are computed
� result passed on to level L+1 as input
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
29
� result passed on to level L+1 as input
� Window length varies with the levels
� the higher the level, the larger the time scale TL
� higher level model longer activities
� abstraction level increases on higher levels
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Layered HMMs
� Lower level HMMs trained independently (separately for
each stream), using Baum-Welch algorithm.
� Low level HMMs recognize fine-grained context
� Nobody present, one person, one active person, multiple people.
Also: music, silence, phone ring
� Output of lower levels passed to higher levels
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
30
� Output of lower levels passed to higher levels
� 2 approaches:
� Maxbelief: Only information from most likely HMM is passed
� Distributional: Full probability distribution over models is passed
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Layered HMMs
� Maxbelief: Winner(t)
T discrete symbols in
{1,…,K}
Ttttt AAAA−−−
,...,,, 21
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
31
� Distributional:
{1,…,K}
−
−
−
−
−
−
1
2
1
1
2
1
1
1
2,...,,
Tt
Tt
K
Tt
t
t
K
t
t
t
K
t
L
L
L
L
L
L
L
L
L
MMM
Vector of K likelihoods for each
time step
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Advantages
� Have smaller state space (parameters) than comparable conventional HMMs
� Less prone to overfitting than HMMs
� Need little training data at each level
� Lower level HMMs can be retrained separately
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
32
� Lower level HMMs can be retrained separately� Adapt to new office settings
� More intuitive, structured representation
� Encodes temporal structure of the activity modeling problem
� Difficulty: Time granularity of each step defined manually (1sec, 5sec,…)
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Visual features
� Skin color density (over whole image)
(classification using skin/non skin color histograms in HSV space)
� Motion density
(image differences)
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
33
� Foreground pixel density
(background subtraction using learned background)
� Face pixel density
(using real-time face detector)
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Comparison HMM, L-HMM
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
34
Single HMM Hierarchical HMM
(Likelihoods are those of
the highest level models)Illustration: per-frame normalized
likelihoods of the models during real-time
testing of different office activities
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)
Room/Office Activities
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
35
Room/Office Activities
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)
Activity Recognition and Room-level
Tracking in an Office Environment
� Activity recognition allows to infer:
� User‘s situation and availability
� Interactions within groups
� Can be used to produce a diary of each day
� Project goals
� Detection of local events (e.g. somebody is entering a room, phone
[Wojek et al. 2006]
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
36
� Detection of local events (e.g. somebody is entering a room, phone
call, ...)
� Fusion of those to detect global situations (e.g. meeting)
� (Track people‘s locations across offices)
� Use lightweight feature set and simple equipment that works under
varying conditions
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Floor Layout / Sensor Setup
Office B
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
37
Office D
� Seven people in four offices (plus smart room)
� Sensors:
� one camera per office
� one omnidirectional microphone per office
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Features
� Video features
� Foreground
� Optical flow
� Audio
� Signal Energy
� Zero Crossing Rate
Foreground
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
38
� Zero Crossing Rate
� Pitch
� Uses data driven local feature model
� Foreground is modeled as GMM
� Video features are calculated for each Gaussian
� Data driven way to find meaningful areas
� Reduces dimensionality !
Learned FG model
Optical Flow
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)
Foreground Detection
� Alpha-weighted difference images to detect foreground regions� Simple background model:
� Pixels classified as foreground with distance > m to background:
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
39
� Adaptation speed set via alpha
� Fast and robust
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Example Foreground Segmentation
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
40
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)
Activity Recognition – Local feature
modelC
om
pu
ter
Vis
ion
fo
r H
um
an
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
41Resulting Gaussians for significant image areas and their first three standard deviations
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Activity Recognition with Layered HMMs
� Idea
� First layer consists of two groups of HMMs (Audio HMMs and
Video HMMs) to detect events
� Higher level HMMs are fed with the output probabilities of lower
level HMMs in order to detect situations
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
42
� Feature vector structure on lowest level:
� Video features:
� Audio features:
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Multi-Layer HMMs
� Multi-Layered HMM Approach � First layer to detect events
� Higher level HMMs to detect situations
� Examples for events� Somebody is sitting at a certain place (V)
� Somebody is entering a room (V)
� Somebody is leaving a room (V) High-level HMMs
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
43
Somebody is leaving a room (V)
� Somebody is talking vs. ambient noise (A)
� Examples for situations� Meeting with a visitor
� Desk work
� Discussion in an office
� Nobody in office
Low-level HMMs (A+V)
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Results (Office B, two persons)
� Training data: 4 full days
� Test data: 2 full days
� Both included:
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
44
� day light
� artificial light (evening)
� cloudy skies
� Sunny light
� …
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Results (Office B, two persons)
� First Level:
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
45
� Second Level:
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)
VideoC
om
pu
ter
Vis
ion
fo
r H
um
an
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
46
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Summary [Wojek et al.]
� System allows
� for detecting events and situations in several offices
� for tracking colleagues on the floor (not explained here, see
paper)
� Real-world data used
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
47
� Real-world data used
� recorded seven days during working hours (tested on two days)
� data includes all kinds of illumination (sunlight, cloudy sky,
artificial illumination at night, etc.)
� Useful
� to provide a semantic description of what is going on (and where)
� for example as a diary
� to determine availability of people
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)
Activity Maps for Location-Aware
Computing [Demirdjian02]
� Target:
� Recognize basic activities in an office environment.
� Method:
� Estimation of activity based
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
48
� Estimation of activity based on “activity zones” and primitive features (position, height).
� Sensors, features:
� Stereo cameras
� Disparity, intensity images
� Person trajectories + height in a plan view of the room
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Automatic estimation of activity maps
� Activity zones are not defined manually
� Rather: Automatic segmentation based on
observed features
� Features:
� From tracker we get 3D information history (x,y,h)
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
49
� From tracker we get 3D information history (x,y,h)
� �calculate feature f(x,y) = (h, v, vlt)
� h: person height from range image
� : ground plane velocity
� vlt: average ground plane velocity over certain time frame
22
yx vvv +=
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Segmentation into Activity Zones
� Two steps:
� 1) Cluster features, independent of spatial
information into classes.
� 2) Group features from the same classes
that are close to each other into zones
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
50
that are close to each other into zones
(eliminate too small zones)
� In the resulting map, one area may
correspond to several overlapping
activity zones
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)K-Means Clustering
� 1) Initialize K cluster
centers randomly.
� 2) Assign each data point
to the closest cluster
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
51
to the closest cluster
center.
� 3) Recompute cluster
centers based on
respective data points
� 4) Repeat until terminated
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)K-Means Clustering
� 1) Initialize K cluster
centers randomly.
� 2) Assign each data point
to the closest cluster
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
52
to the closest cluster
center.
� 3) Recompute cluster
centers based on
respective data points
� 4) Repeat until terminated
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)K-Means Clustering
� 1) Initialize K cluster
centers randomly.
� 2) Assign each data point
to the closest cluster
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
53
to the closest cluster
center.
� 3) Recompute cluster
centers based on
respective data points.
� 4) Repeat until terminated.
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)2nd step: Region growing
� Select a feature from each class (seed).
� Find features from the same class that lie close to the seed, add to region.
� When region can not be grown anymore, select new seed until all features from the class are
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
54
all features from the class are used.
� Creates spatial groupings of features belonging to the same class (remove regions with few features).
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)2nd step: Region growing
� Select a feature from each class (seed).
� Find features from the same class that lie close to the seed, add to region.
� When region can not be grown anymore, select new seed until all features from the class are
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
55
all features from the class are used.
� Creates spatial groupings of features belonging to the same class (remove regions with few features).
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)2nd step: Region growing
� Select a feature from each class (seed).
� Find features from the same class that lie close to the seed, add to region.
� When region can not be grown anymore, select new seed until all features from the class are
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
56
all features from the class are used.
� Creates spatial groupings of features belonging to the same class (remove regions with few features).
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)2nd step: Region growing
� Select a feature from each class (seed).
� Find features from the same class that lie close to the seed, add to region.
� When region can not be grown anymore, select new seed until all features from the class are
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
57
all features from the class are used.
� Creates spatial groupings of features belonging to the same class (remove regions with few features).
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)2nd step: Region growing
� Select a feature from each class (seed).
� Find features from the same class that lie close to the seed, add to region.
� When region can not be grown anymore, select new seed until all features from the class are
2
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
58
all features from the class are used.
� Creates spatial groupings of features belonging to the same class (remove regions with few features).
1
34
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Detecting a person’s activity zone
� Determine the person’s position and feature vector
f = (h,v,vlt) through the stereo tracker.
� Find the subset Z of zones lying close to the
person position
� Choose the correct zone from the subset based on
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
59
� Choose the correct zone from the subset based on
the feature vector f.
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Example: Two person office
Zone Category
1 Walking
2 Working Desk - User A
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
60
3 Working Desk -User B
4 Working Desk - User B
(faster)
5 File cabinet
Feature classes
after k-means
Automatically
determined activity
zonesCategory names are
human interpretation!
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)Summary
� Temporal Templates to classify aerobic movements � MEI, MHI, scale-, translation- and rotation-invariant features
� Layered HMMs � Deduce high-level activities from low-level events
� Reduces state-space, amount of needed training data, helpful to model temporal granularities
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
61
model temporal granularities
� Unsupervised clustering of activities� cluster activity zones (k-means, region-growing)
Co
mp
ute
r V
isio
n f
or
Hu
ma
n-C
om
pu
ter
Inte
rac
tio
n
Un
ive
rsit
ät
Ka
rls
ruh
e (
TH
)References
F. Bobick, J. Davis. The Recognition of Human Movement UsingTemporal Templates. IEEE PAMI, Vol. 23, No. 3, March 2001
N. Oliver, E. Horvitz, A. Garg. Layered Representations for Human Activity Recognition. Proceedings of the 4th IEEE International Conference on Multimodal Interfaces (ICMI)
C. Wojek, K. Nickel, R. Stiefelhagen, Activity Recognition and Room Level Tracking in an Office Environment , IEEE Int. Conference on Multisensor Fusion and Integration for Intelligent Systems - MFI06, September 2006
Co
mp
ute
r V
isio
n f
or
Hu
ma
n
Re
se
arc
h G
rou
p,
Un
ive
rsit
ät
cv:h
ci
62
D. Demirdjian, K. Tollmar, K. Koile, N. Checka, T. Darrell. Activity Maps for Location-Aware Computing. Proceedings of the Sixth IEEE Workshop on Applications of Computer Vision, 2002