Activity Recognition - cv:hci - CVHCI...Activity Recognition and Room-level Tracking in an Office...

Activity Recognition

Interactive Systems Laboratories, Universität Karlsruhe (TH)1

2008-01-29

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Termine

� Fr, 06.02.2009

Project 3: Student Presentations

� Mo, 09.02.2009

Audio-Visual Speech Recognition

� Fr, 13.02.2009

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

� Fr, 13.02.2009

Wiederholung

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Overview

� Introduction

� Motivation

� Typical Approaches

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

3

� Examples

� Recognition of Human Movements using Temporal Templates

� Layered HMMs for Activity Recognition

� Activities in offices (2 examples)

� Automatic Segmentation of Activity Zones in a lab

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Introduction

� Why activity recognition?

� Gain a higher level understanding of the scene

� Not just: Person locations, movement, orientation

� Rather:

� What are these persons doing (walking, sitting, working, hiding)

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

4

hiding)

� how are they doing it

� what is going in the scene (meeting, party, telephone conversation, etc…)

� Useful for video indexing/analysis, smartrooms, surveillance, etc.

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Types of activities

� Single person:

� Activities: Jump, kneel,

pick, put, run, sit, stand,

walk

� But also: step left, step

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

5

right, backhand, swing,

slide

� Usually video analysis

on close-up views

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH


� Small groups (meetings):

� Individual activities

� Speaking, writing, listening,

walking, standing up, sitting

down, “fidgeting”,…

� Group activities

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

6

� Group activities

� Meeting start, end,

discussion, presentation,

monologue, dialogue, white

board, note-taking

� Often audio-visual cues

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH


� Rooms (office, kitchen):

� Events:

� Entering/leaving the room

� working on the desk

� making a phone call

� Making coffee

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

7

� Activities composed by events (in one or more offices):

� phone conference

� meeting

� short interrupt / discussion

� fetching printouts from the printer in the lab

� Here also, audio-visual cues, but coarser in nature.

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH


� Outdoor activities:

� Mostly surveillance, for ex.

In parking lots, in front of

stores, in train stations:

� Car enter, car leave,

person enter, pickup, drop

object (bomb?), hide,

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

8

object (bomb?), hide,

follow person, etc…

� Recently became very

popular field because of the

“fight against terror”.

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Approaches

� Classification problem, typically observation over

time

� Similar to gesture recognition

� Typical classifiers

� HMMs and variants, e.g. Coupled HMMs, Layered

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

9

� HMMs and variants, e.g. Coupled HMMs, Layered

HMMs

� Dynamic Bayesian Networks (DBN)

� But also: clustering, template matching, SVM, Neural

Nets

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)

Example approaches

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

10

Example approaches

Single person activities

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)

Recognition of Human Movement using

Temporal Templates [Bobick01]

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

11

� Objective: Classify a set activities based on a person’s motion

� Input:

� Several close-up camera views

� Static, indoor scene

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Motivation

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

12

� Even with almost no structure in the video, humans can recognize activity through motion (walking, sitting down)

� No 2D/3D reconstruction of body model necessary

� Need to know:

� Where is motion?

� How is it moving?

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Motion Features

� Motion Energy Image (MEI)

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

13

� Captures the information: Where is motion

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)MEI

� Let I(x,y,t) be an image sequence

� Let D(x,y,t) be a binary image sequence indicating

regions of motion (e.g. difference image)

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

14

� The MEI is defined as:

� τ is the size of the observation window (1-2 secs)

U1

0

),,(),,(−

=

−=

τ

τ

i

ityxDtyxE

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Motion Features

� Motion History Image

(MHI)

� Captures the information: How is

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

15

� Captures the information: How is

motion done

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)MHI

−−

=

=

.)1)1,,(,0max(

1),,(),,(

otherwisetyxH

tyxDiftyxH

τ

τ

τ

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

16

� Result: More recently moving pixels are brighter

� Note: MEI can be generated by thresholding the MHI

above zero

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Why both MEI and MHI?

� For some moves,

MEIs are similar,

for others, MHIs

are similar

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

17

� MEI and MHI

capture different

characteristics of

motion

� “where” and “how”

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Multi-view MEI/MHI

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

18

� Some motions have similar temporal templates

� � Use templates from several viewing angles

MEI of sitting movement over 90 degree viewing angle

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Matching Temporal Templates

� Training

� Collect training examples for each move from a variety

of viewing angles

� Generate MEIs and MHIs

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

19

� Calculate scale and orientation invariant features (Hu-

moments) on images � Feature vector

� Build statistical model of the moments (mean,

covariance matrix)

� Recognition:

� Calculate Mahalanobis distance between moment

description of input and each of the stored movements.

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Moments as Shape Descriptors

� Idea: a density distribution (e.g. an image) is well described by its

moments

� � use statistical properties (moments) to describe their shape

� Two-dimensional (p+q)th order moments of a density distribution

function:

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

20

� Central moments (invariant to translation):

� where

(centroids)

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Hu-moments [Hu 1962]

� Goal: Find translation-, scale- and rotation-invariant moments to do pattern recognition

� Central moments (first four orders):

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

21

� Normalize for scale-invariance:

� where

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Hu-moments

� The first seven orientation invariant Hu-Moments

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

22

� Hu-Moments are translation-, scale- and rotation-invariant.

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Recognized Moves

� 18 aerobic exercises

� Several executions

� Seven views (-90 to

+90 deg, 30deg

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

23

+90 deg, 30deg

increments)

� Results

� With 1 camera: 12 out

of 18 correct

� With 2 cameras: 15 out

of 18 correct

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Temporal Templates in 3D

� Temporal Templates can also be applied to voxels

� Analysis of moments with a PCAclassifier lead to robust results

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

24

� Classification of 8 actions:

� raising hand,

� sitting down,

� waving hands,

� crouching down,

� standing up,

� punching,

� kicking or

� jumping.

Motion Energy Volume

Motion History Volume

[System by UPC, 2006]

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)

Small Group Activities

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

25

Small Group Activities

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)

Layered Representations for Human Activity

Recognition [Oliver02]

� Target:� Recognize complex human activities over longer period of time

(“context” in an office setting).

� Types of context (situations, activities, etc):� Phone conversation

� Face to face conversation

� Presentation

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

26

� Distant conversation

� Others…

� Sensors:� Binaural microphones

� USB camera

� Keyboard and mouse

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Hierarchical approach

� Problem with normal HMMs:

� Lack of structure

� Large parameter space

� Overfitting on long sequences with little training data

�Bad generalization

� Fusion of various streams possible, but multiplies

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

27

� Fusion of various streams possible, but multiplies required parameters � need even more training data

� Solution:

� Hierarchical (Layered) Hidden Markov Models (LHMMs)

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Layered HMMs

Activities: Phone conversation,

Face to face conversation,

Presentation, Distant

conversation

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

28

Output of lower layer is

input to higher layer

Classes: Nobody present, one

person, one active person,

multiple people.

Music, silence, phone ring

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Temporal Granularity

� HMMs at level L use sliding windows of TL

samples

� Data in time window at level L is analyzed

� likelihoods are computed

� result passed on to level L+1 as input

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

29

� result passed on to level L+1 as input

� Window length varies with the levels

� the higher the level, the larger the time scale TL

� higher level model longer activities

� abstraction level increases on higher levels

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Layered HMMs

� Lower level HMMs trained independently (separately for

each stream), using Baum-Welch algorithm.

� Low level HMMs recognize fine-grained context

� Nobody present, one person, one active person, multiple people.

Also: music, silence, phone ring

� Output of lower levels passed to higher levels

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

30

� Output of lower levels passed to higher levels

� 2 approaches:

� Maxbelief: Only information from most likely HMM is passed

� Distributional: Full probability distribution over models is passed

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Layered HMMs

� Maxbelief: Winner(t)

T discrete symbols in

{1,…,K}

Ttttt AAAA−−−

,...,,, 21

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

31

� Distributional:

{1,…,K}

−

−

−

−

−

−

1

2

1

1

2

1

1

1

2,...,,

Tt

Tt

K

Tt

t

t

K

t

t

t

K

t

L

L

L

L

L

L

L

L

L

MMM

Vector of K likelihoods for each

time step

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Advantages

� Have smaller state space (parameters) than comparable conventional HMMs

� Less prone to overfitting than HMMs

� Need little training data at each level

� Lower level HMMs can be retrained separately

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

32

� Lower level HMMs can be retrained separately� Adapt to new office settings

� More intuitive, structured representation

� Encodes temporal structure of the activity modeling problem

� Difficulty: Time granularity of each step defined manually (1sec, 5sec,…)

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Visual features

� Skin color density (over whole image)

(classification using skin/non skin color histograms in HSV space)

� Motion density

(image differences)

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

33

� Foreground pixel density

(background subtraction using learned background)

� Face pixel density

(using real-time face detector)

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Comparison HMM, L-HMM

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

34

Single HMM Hierarchical HMM

(Likelihoods are those of

the highest level models)Illustration: per-frame normalized

likelihoods of the models during real-time

testing of different office activities

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)

Room/Office Activities

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

35

Room/Office Activities

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)

Activity Recognition and Room-level

Tracking in an Office Environment

� Activity recognition allows to infer:

� User‘s situation and availability

� Interactions within groups

� Can be used to produce a diary of each day

� Project goals

� Detection of local events (e.g. somebody is entering a room, phone

[Wojek et al. 2006]

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

36

� Detection of local events (e.g. somebody is entering a room, phone

call, ...)

� Fusion of those to detect global situations (e.g. meeting)

� (Track people‘s locations across offices)

� Use lightweight feature set and simple equipment that works under

varying conditions

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Floor Layout / Sensor Setup

Office B

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

37

Office D

� Seven people in four offices (plus smart room)

� Sensors:

� one camera per office

� one omnidirectional microphone per office

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Features

� Video features

� Foreground

� Optical flow

� Audio

� Signal Energy

� Zero Crossing Rate

Foreground

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

38

� Zero Crossing Rate

� Pitch

� Uses data driven local feature model

� Foreground is modeled as GMM

� Video features are calculated for each Gaussian

� Data driven way to find meaningful areas

� Reduces dimensionality !

Learned FG model

Optical Flow

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)

Foreground Detection

� Alpha-weighted difference images to detect foreground regions� Simple background model:

� Pixels classified as foreground with distance > m to background:

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

39

� Adaptation speed set via alpha

� Fast and robust

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Example Foreground Segmentation

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

40

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)

Activity Recognition – Local feature

modelC

om

pu

ter

Vis

ion

fo

r H

um

an

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

41Resulting Gaussians for significant image areas and their first three standard deviations

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Activity Recognition with Layered HMMs

� Idea

� First layer consists of two groups of HMMs (Audio HMMs and

Video HMMs) to detect events

� Higher level HMMs are fed with the output probabilities of lower

level HMMs in order to detect situations

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

42

� Feature vector structure on lowest level:

� Video features:

� Audio features:

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Multi-Layer HMMs

� Multi-Layered HMM Approach � First layer to detect events

� Higher level HMMs to detect situations

� Examples for events� Somebody is sitting at a certain place (V)

� Somebody is entering a room (V)

� Somebody is leaving a room (V) High-level HMMs

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

43

Somebody is leaving a room (V)

� Somebody is talking vs. ambient noise (A)

� Examples for situations� Meeting with a visitor

� Desk work

� Discussion in an office

� Nobody in office

Low-level HMMs (A+V)

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Results (Office B, two persons)

� Training data: 4 full days

� Test data: 2 full days

� Both included:

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

44

� day light

� artificial light (evening)

� cloudy skies

� Sunny light

� …

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Results (Office B, two persons)

� First Level:

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

45

� Second Level:

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)

VideoC

om

pu

ter

Vis

ion

fo

r H

um

an

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

46

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Summary [Wojek et al.]

� System allows

� for detecting events and situations in several offices

� for tracking colleagues on the floor (not explained here, see

paper)

� Real-world data used

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

47

� Real-world data used

� recorded seven days during working hours (tested on two days)

� data includes all kinds of illumination (sunlight, cloudy sky,

artificial illumination at night, etc.)

� Useful

� to provide a semantic description of what is going on (and where)

� for example as a diary

� to determine availability of people

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)

Activity Maps for Location-Aware

Computing [Demirdjian02]

� Target:

� Recognize basic activities in an office environment.

� Method:

� Estimation of activity based

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

48

� Estimation of activity based on “activity zones” and primitive features (position, height).

� Sensors, features:

� Stereo cameras

� Disparity, intensity images

� Person trajectories + height in a plan view of the room

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Automatic estimation of activity maps

� Activity zones are not defined manually

� Rather: Automatic segmentation based on

observed features

� Features:

� From tracker we get 3D information history (x,y,h)

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

49

� From tracker we get 3D information history (x,y,h)

� �calculate feature f(x,y) = (h, v, vlt)

� h: person height from range image

� : ground plane velocity

� vlt: average ground plane velocity over certain time frame

22

yx vvv +=

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Segmentation into Activity Zones

� Two steps:

� 1) Cluster features, independent of spatial

information into classes.

� 2) Group features from the same classes

that are close to each other into zones

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

50

that are close to each other into zones

(eliminate too small zones)

� In the resulting map, one area may

correspond to several overlapping

activity zones

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)K-Means Clustering

� 1) Initialize K cluster

centers randomly.

� 2) Assign each data point

to the closest cluster

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

51


center.

� 3) Recompute cluster

centers based on

respective data points

� 4) Repeat until terminated

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)K-Means Clustering


centers randomly.



Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

52


center.


centers based on

respective data points

� 4) Repeat until terminated

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)K-Means Clustering


centers randomly.



Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

53


center.


centers based on

respective data points.

� 4) Repeat until terminated.

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)2nd step: Region growing

� Select a feature from each class (seed).

� Find features from the same class that lie close to the seed, add to region.

� When region can not be grown anymore, select new seed until all features from the class are

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

54

all features from the class are used.

� Creates spatial groupings of features belonging to the same class (remove regions with few features).

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH





Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

55



Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH





Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

56



Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH





Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

57



Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH





2

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

58



1

34

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Detecting a person’s activity zone

� Determine the person’s position and feature vector

f = (h,v,vlt) through the stereo tracker.

� Find the subset Z of zones lying close to the

person position

� Choose the correct zone from the subset based on

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

59

� Choose the correct zone from the subset based on

the feature vector f.

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Example: Two person office

Zone Category

1 Walking

2 Working Desk - User A

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

60

3 Working Desk -User B

4 Working Desk - User B

(faster)

5 File cabinet

Feature classes

after k-means

Automatically

determined activity

zonesCategory names are

human interpretation!

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)Summary

� Temporal Templates to classify aerobic movements � MEI, MHI, scale-, translation- and rotation-invariant features

� Layered HMMs � Deduce high-level activities from low-level events

� Reduces state-space, amount of needed training data, helpful to model temporal granularities

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

61

model temporal granularities

� Unsupervised clustering of activities� cluster activity zones (k-means, region-growing)

Co

mp

ute

r V

isio

n f

or

Hu

ma

n-C

om

pu

ter

Inte

rac

tio

n

Un

ive

rsit

ät

Ka

rls

ruh

e (

TH

)References

F. Bobick, J. Davis. The Recognition of Human Movement UsingTemporal Templates. IEEE PAMI, Vol. 23, No. 3, March 2001

N. Oliver, E. Horvitz, A. Garg. Layered Representations for Human Activity Recognition. Proceedings of the 4th IEEE International Conference on Multimodal Interfaces (ICMI)

C. Wojek, K. Nickel, R. Stiefelhagen, Activity Recognition and Room Level Tracking in an Office Environment , IEEE Int. Conference on Multisensor Fusion and Integration for Intelligent Systems - MFI06, September 2006

Co

mp

ute

r V

isio

n f

or

Hu

ma

n

Re

se

arc

h G

rou

p,

Un

ive

rsit

ät

cv:h

ci

62

D. Demirdjian, K. Tollmar, K. Koile, N. Checka, T. Darrell. Activity Maps for Location-Aware Computing. Proceedings of the Sixth IEEE Workshop on Applications of Computer Vision, 2002

Date post:	17-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Activity Recognition - cv:hci - CVHCI...Activity Recognition and Room-level Tracking in an Office...

Documents