+ All Categories
Home > Technology > cvpr2011: human activity recognition - part 5: description based

cvpr2011: human activity recognition - part 5: description based

Date post: 15-Jun-2015
Category:
Upload: zukun
View: 797 times
Download: 2 times
Share this document with a friend
Popular Tags:
51
Frontiers of Human Activity Analysis J. K. Aggarwal Michael S. Ryoo Kris M. Kitani
Transcript
Page 1: cvpr2011: human activity recognition - part 5: description based

Frontiers of

Human Activity Analysis

J. K. Aggarwal

Michael S. Ryoo

Kris M. Kitani

Page 2: cvpr2011: human activity recognition - part 5: description based

2

Description-based

approaches

2000s

Page 3: cvpr2011: human activity recognition - part 5: description based

3

Approach paradigm

Description-based approach

We represent the structure of the activities,

and recognize activities using semantic matching.

Hand shake = “two persons do shake-action (stretches, stays

stretched, withdraw) simultaneously, while touching”.

Recognition by finding observations satisfying the definition.

Humans conceptual

knowledge on

human activity

Human activity

representation

Input sequence Atomic action

(e.g. arm stretch)

recognition Low-level

processing

High-level

activity

recognition

Page 4: cvpr2011: human activity recognition - part 5: description based

4

Comparisons

Approaches Levels of hierarchy

Complex temporal relations

Complex logical con-catenations

Recognition of recursive activities

Handle imperfect low-

levels

Statistical limited

(depends on data amount)

Syntactic unlimited √ √

Siskind 2001 unlimited a sub-event participates only once

Hongeng et al. 2004

limited (3-levels)

√ √

Vu et al. 2003

unlimited √ conjunctions

only

Ryoo and Aggarwal

2009 unlimited √ √ √ √

Gupta et al. 2009

limited (2-levels)

√ network form

only √

Page 5: cvpr2011: human activity recognition - part 5: description based

5

Description-based

approaches

Human interactions

2000s

Page 6: cvpr2011: human activity recognition - part 5: description based

6

Recognition of human interactions

Input sequences

Body-part layer

Pose layer

Gesture layer

Semantic layer

1 : Arm withdrawn

8 : Arm somewhat stretched

13 : Arm fully stretched

1 : Arm withdrawn

8 : Arm withdrawn

13 : Arm withdrawn

<1,20> : Facing right

<1,20> : Arm staying

<1,20> : Leg staying

<1,20> : Facing left

<4,20> : Arm stretching

<1,20> : Leg staying

Interaction

Gesture Elementary

movement of a person

Pose Abstract status

of a body part.

Body-part feature. Numerical status

of a body part.

Person “pushed” by

Interaction:

Person in time interval

<4, 20>

Ryoo and Aggarwal,

CVPR 2006

Page 7: cvpr2011: human activity recognition - part 5: description based

7

Atomic actions

Operation triplets <agent, motion, target>

Gesture together with subject and object information.

Unit human activity.

Computed based on gestures.

Ex> person1 stretches his/her arm

<p1’s arm, stretch, null>

Time intervals

Ex> Time intervals detected for Pointing action

P1:Head :222222222222222222

P1:ArmV :322021221111122333

P1:ArmH :100022222222221100

Sequences of poses

<p1‟s arm, stretch, p2>

<p1‟s arm, stay stret., p2>

<p1‟s arm, withdraw, null>

Time

Time intervals of operation triplets

Page 8: cvpr2011: human activity recognition - part 5: description based

Punching is a sequence of hand

stretch and withdrawal.

Semantic layer recognition

8

… …

Video observations Time intervals of gestures

<p2‟s arm, stretch, p1>

<p2‟s arm, stay stret., p1>

<p2‟s arm, withdraw, null>

<p2‟s arm, stay withd., null>

<p2‟s head, face left, null>

<p1‟s arm, withdraw, nulll>

Time

Similar?

Machine-understandable

representation of Punching

Page 9: cvpr2011: human activity recognition - part 5: description based

9

Human activity representation

Semantics

Knowledge on the structure

of an activity.

Punching is a sequence of

hand stretch and withdrawal.

Time intervals

Allen‟s temporal predicates

Syntax

Rules to construct formal

representation.

Organizes a set of

vocabularies to describe

the activities‟ structure.

Context-free grammar

Conceptual/verbal description

x=arm_

stretch(p1)

y=arm_

withdraw(p1)

this=Punching_action (p1) CFG

Machine-understandable language

Punching_action(i) = (

list( def(„x‟, Arm_Stretch(i)),

def(„y‟, Arm_Withdraw(i)) ),

and( meets(„x‟, „y‟),

and( starts(„x‟, „this‟),

finishes(„y‟, „this‟)) ) );

Page 10: cvpr2011: human activity recognition - part 5: description based

10

Hierarchical activity representation

Representation of the „shake-hands‟ interaction

Description-based

ShakeHandsInteractions(i, j) = (

list( def(„x‟, ShakeHandsAction(i)),

list( def(„y‟, ShakeHandsAction(j)),

def(„z‟, TouchingInteraction(i, j))) ),

and( and( during(„z‟, „x‟), during(„z‟, „y‟)),

and( starts(„z‟, „this‟), finishes(„z‟, „this‟)))

);

ShakeHandsActions(i) = (

list( def(„x‟, Arm_Stretch(i)),

list( def(„y‟, Arm_Stay_Stretched(i)),

def(„z‟, Arm_withdraw(i))) ),

and( and( meets(„x‟, „y‟), meets(„y‟, „z‟)),

and( starts(„x‟, „this‟), finishes(„z‟, „this‟)))

);

TouchingInteraction(i, j)=(null, touch(i, j, 0));

Shake hands

interaction

CFG

Syntax Hand shake = “two persons do

shake-action (stretches, stays stretched,

withdraw)

simultaneously, while touching”.

Page 11: cvpr2011: human activity recognition - part 5: description based

11

Hierarchical recognition algorithm

Recognition process of the „Shake-hands‟ interaction.

Page 12: cvpr2011: human activity recognition - part 5: description based

12

Fighting Interaction(i, j)

Negative Interaction(i, j)

Fighting Interaction(i, j)

Fighting Interaction(i, j)

Negative Interaction(i, j)

Base Case

Fighting Interaction(i, j)

Negative Interaction(i, j)

Continued and recursive activities

Interaction „fighting‟

Composed of multiple negative interactions

Punching + kicking + pushing + punching + …

Iterative approach is taken.

Page 13: cvpr2011: human activity recognition - part 5: description based

13

Experiments - Simple interactions

Recognized 8 types of simple interactions, which were

recognized in Park and Aggarwal, 2004

(approach, depart, point, shake-hands, hug, punch, kick, and

push)

A videos of a sequence of interactions are taken. (continuous

executions)

Interactions are described in more detailed and formal way,

resulting better recognition accuracy.

Page 14: cvpr2011: human activity recognition - part 5: description based

14

Example Experiment - Fighting

Poses: Gestures

and

activities:

Input video: Processed video:

Page 15: cvpr2011: human activity recognition - part 5: description based

Past-Now-Future networks

Pinhanez and Bobick 1998

PNF networks to represent temporal structure

of an activity.

Kitchen activities:

15

[Pinhanez, C. S. and Bobick, A. F., Human action detection using PNF propagation

of temporal constraints. CVPR 1998]

Page 16: cvpr2011: human activity recognition - part 5: description based

Event logic

Siskind 2001

Logical concatenations of predicates

Time intervals?

16

[Siskind, J. M., Grounding the lexical semantics of verbs in visual perception using force

dynamics and event logic. Journal of Artificial Intelligence Research (JAIR) 15, 2001]

Page 17: cvpr2011: human activity recognition - part 5: description based

Representation languages

Nevatia, Zhao,

and Hongeng 2003

VERL - language

Vu, Bremond,

Thonnat 2003

Similar to Nevatia

et al. 2003

Recursive? Uncertainties? 17

Page 18: cvpr2011: human activity recognition - part 5: description based

Stochastic approaches

Limitations of the conventional description-

based approaches

Uncertainties? – stochastic recognition

Probabilistic framework needed

[Ryoo and Aggarwal, IJCV 2009]

[Tran and Davis, ECCV 2008] - MLN

18

Page 19: cvpr2011: human activity recognition - part 5: description based

19

MoveR(s1) MoveR(s1) MoveR(p1) MoveR(p1) equals equals

Hierarchical matching algorithm

Recognition process tree of „steal(p1, s1, p2)‟

P Obj: (person p1, suitcase s1) P Obj: (person p2, suitcase s1)

Steal(p1, s1, p2)

Carry(p1, s1)

Parameter Objects:

(person p1, suitcase s1, person p2)

Carry(p2, s1) meets meets

MoveL(s1) MoveL(p2)

Recognition results from the object and motion layers

Stay(s1)

Recognition results from the object and motion layers

MoveL(p2) MoveL(s1)

Carry(p1, s1) Stay(s1) Carry(p2, s1)

Steal(p1, s1, p2)

P(A1 | O1, M1) = 0.8 P(A2 | O2, M2) = 0.7

P(A3 | O3, M3) = 0.5

P(A4 | O4, M4) = 0.9 P(A5 | O5, M5) = 0.9

P(S | O1, M1, …) = 0.8

Page 20: cvpr2011: human activity recognition - part 5: description based

Probabilistic recognition

Probability of the activity given observation

20

Gesture detection

confidence

Description-based

.

.

.

.

Structural

similarity

Page 21: cvpr2011: human activity recognition - part 5: description based

21

Experiments

Recognized following six types of interactions. Each activity was tested with at least 10 sequences.

Carrying a box, leaving a box, placing a box into a trash bin.

Carrying a suitcase, leaving a suitcase, stealing the suitcase.

Object and Motion layer trained with 5 sequences.

Time

Carry(Person1, SuitCase1) :

Stay(SuitCase1) :

Carry(Person2, SuitCase1) :

Steal(Person1, SuitCase1, Person2) :

Page 22: cvpr2011: human activity recognition - part 5: description based

22

Experiments

Example

a person placing a box into a trash bin

Time

Move(Person1, right) :

Move(Box1, right) :

Move(Person1, left) :

Move(Box1, down) :

Carry(Person1, Box1) :

Trash(Person1, Box1, TrashBin1) :

Page 23: cvpr2011: human activity recognition - part 5: description based

23

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recognition accuracy (true positives): Compared with a multi-object version of previous works.

False positives rates are almost zero for all activities.

Experiments - Performance

Carry -Suitcase

Leave -Suitcase

Steal -Suitcase

Carry -Box

Leave -Box

Trash -Box

Total

2 atomics

1 spatials

3 atomics

1 spatials

5 atomics

2 spatials

2 atomics

1 spatials

3 atomics

1 spatials

3 atomics

3 spatials

P

Page 24: cvpr2011: human activity recognition - part 5: description based

24

Advantages

Ability to represent and recognize an activity composed of concurrent sub-events. Ex> “touching occurred during pushing”

Ability to represent and recognize „recursive activities‟ Ex> Fighting = Fighting + another negative interaction.

Less data required for training. „Structure of activities‟ are encoded based on human knowledge.

High recognition accuracy?

this==Pushing_interactions(p1,p2)

i=Arm_Stretch(p1)

j=Arm_Stay_Stretched(p1)

l =Depart(p2, p1)

k =Touch(p1, p2)

Page 25: cvpr2011: human activity recognition - part 5: description based

25

Description-based

approaches

Group activities

2008, 2009

Page 26: cvpr2011: human activity recognition - part 5: description based

Group activity

Events performed by

groups

Various types of

complex activities

Group-person interaction

Group-group interaction

Uncertain nature

Varying # of

participants

Dynamic spatial

relation

26 26

Group Assault (grp vs. per)

A person with red shirts is

taking the laptop on the table

while the others are talking

Group Stealing (grp vs. grp)

Ryoo and Aggarwal,

CVPR-SIG 09, IJCV 11

Page 27: cvpr2011: human activity recognition - part 5: description based

Formal representation:

1. Member variables

2. Time intervals

3. Predicates

Formal representation:

1. ∃ a in Thieves, ∀ b in Owners,

∃ c in Thieves

2. def(t1, TakeObject(a)),

def(t2, Distract(c, b))

3. equals (t1, this),

during (t1, t2)

27

Representation

Group stealing

Distract! Distract!

Distract!

∀ ∀

∃ ∃

Take object!

∃ Distract (r1, g1)

Distract (r2, g1)

Distract (r3, g2)

TakeObject (r4)

Stealing(R, G)

Time

Time intervals of activities of

individual members

Owners

Thieves

Object

a

b b

c c c

during

equals

t2

t1

this

Page 28: cvpr2011: human activity recognition - part 5: description based

28

Recognition overview

3 key components

Recognition:

Obtain a pool of group member candidates

with non-zero probability.

Not many persons perform sub-events.

Who?

Distract Distract

Distract

Take

Stand

What? When?

Distract

Take

Stand

Distract

Distract

Generates a pool of member candidates

)|)((maxarg* Mt

M OMGPM

NP-hard. Approximation required.

Page 29: cvpr2011: human activity recognition - part 5: description based

29

Temporal constraints

Hierarchical temporal constraint matching.

Steal(T, O)

Approach(c, b)

Member variables:

∃a in Thieves, ∀b in Owners, ∃c in Thieves

TakeObject(a) before during

Video inputs

Distract(c, b)

Video inputs

Steal(T, O)

Distract(c, b) Approach(c, b) TakeObject(a)

Page 30: cvpr2011: human activity recognition - part 5: description based

30

Group candidates

Among possible groupings,

Find a set of group members:

which maximizes the overall

probability.

Bayesian formulation

where

)|)((max)|( Mt

M

t OMGPOGP

)()(

)(max

MM

M

GG

GM

))(())(|()( MGPMGOPM tt

MG

},...,,{ ||21 MmmmM

Page 31: cvpr2011: human activity recognition - part 5: description based

31

Bayesian formulation

Ci: persons performing ith sub-event.

Essential and anti-essential relations: Ki, Li

1

2

3

4

5

6

1

2

3

4

5

6

1

2

3

4

5

6

1

2

3

4

5

6

Case1:

∃∃ Case2:

∀∃

Case3:

∃∀ Case4:

∀∀

ii CKk

iti

iii OkSECK ]|)([||

tnn

t SS

ttn

n

ttn

n

t

M

t

M MGSSPSSQMOPMGOP,...,

1

1

1

11

1

))(|,...,(),...,,,|())(|(

tnn

t iiiiiiSS

t

i

KCLKCK

i

tti

i

rel

ba

GMGed

MGSPSSrelP

M,...,

|)|/||||/|(|1

1)(

))(|(),|(

)(

ii CLl

iti

iii OlSECL ]|)([||

Represented relations Represented sub-events

Structural similarity

Page 32: cvpr2011: human activity recognition - part 5: description based

32

Markov chain Monte Carlo

MCMC-based probability estimation.

Provides a set of samples from the distribution.

Models the probability distribution.

Metropolis-Hastings algorithm

Actions:

Add:

Remove:

)',()(

),'()'(

MMqM

MMqMa

G

G

),1min()',( 1 aMMP t

it CmmMM where}{' 1

}{' 1 mMM t

Add

Add Add Remove

Add

Page 33: cvpr2011: human activity recognition - part 5: description based

33

Experimental setting

We have tested 45 sequences of 8 activities. 320*240 with 10 fps

CCTV videos download from YouTube. Group stealing in Malaysia and group arresting in UK.

Videos that we have taken with 10 participants in various environments. A group of people carrying a large object.

A group of people assaulting a person.

Videos of real human activities

Page 34: cvpr2011: human activity recognition - part 5: description based

34

Experiments - stealing

Group stealing

One of thieves

steals a laptop,

while the other

thieves are

distracting the

shop owner.

Laptop

Thieves

Owners

Page 35: cvpr2011: human activity recognition - part 5: description based

35

Experiments - arresting

Group arresting

A group of

policemen

arresting a group

of suspicious

persons.

Color histogram

Pedestrians

Policemen

Criminal

candidates

Page 36: cvpr2011: human activity recognition - part 5: description based

36

Experiments – group assault

Highly stochastic

There may be (and may not be) attackers whose

guarding the area, or just watching.

10 videos.

Page 37: cvpr2011: human activity recognition - part 5: description based

37

Experiments – group assault

Highly stochastic

There may be (and may not be) attackers whose

guarding the area, or just watching.

10 videos.

Page 38: cvpr2011: human activity recognition - part 5: description based

38

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Experimental results

Recognition accuracy

False positive rates are almost 0 because of the detailed

representations: previous, deterministic, stochastic.

Move G

Carry G

CarryCmd GP

Fight IG

Steal GG

Arrest GG

Total

∀ ∃∀

Fight GG

∀∃ ∃∃ ∃∀∃ ∀∃ ∃∀

Assault GG

∀∃∃

Page 39: cvpr2011: human activity recognition - part 5: description based

39

Spatio-Temporal

Relationship Match

2009

Page 40: cvpr2011: human activity recognition - part 5: description based

40

Description-based vs. Space-time

Description-based

High-level activities

Hierarchical

Semantic structures

Difficult to cope with

noise

Space-time

Reliable under noise

Difficult to model

complex activities

Miss semantic

structures

t

Page 41: cvpr2011: human activity recognition - part 5: description based

41

Space-time approaches

Video classification

Each video is represented as a histogram

Limitation:

Unable to model complex activities

Pushing

Hugging Shaking hands

Punching

t

Spatio-temporal features

Laptev 04,

Dollar et al. 05

Page 42: cvpr2011: human activity recognition - part 5: description based

42

Spatio-temporal relations (STRs)

t

t

before(17, 12)

x

y

t

12

35

17

5

overlaps(12, 35)

overlaps(5, 17)

...

x

y

t

12

35

17

5

overlaps(12, 35)

before(17, 12)

...

equals(5, 17)

Videos Feature relations Ryoo and Aggarwal,

ICCV 2009

Page 43: cvpr2011: human activity recognition - part 5: description based

43

Histogram of STRs

before(17, 12)

x

y

t

12

35

17

5

overlaps(12, 35)

overlaps(5, 17)

...

x

y

t

12

35

17

5

overlaps(12, 35)

before(17, 12)

...

equals(5, 17)

Page 44: cvpr2011: human activity recognition - part 5: description based

44

STR-match learning

Supervised learning

Videos with activity labels are provided.

Shaking hands

Hugging Pushing

Punching

Feature space

Histogram of Relationships

?

Page 45: cvpr2011: human activity recognition - part 5: description based

45

STR equations

STR match considers distributions of pair-

wise relationships among features.

Histogram construction

STR distance:

Page 46: cvpr2011: human activity recognition - part 5: description based

46

expected_starting(vtest)

STR-match activity detection

Must detect starting time and ending time

Models starting XYT location of an activity.

Each feature pair in a matching training video

makes a vote.

x

y

t

12

35

17 5

vtrstart

x

y

t

12

35

17

5

Original starting Expected starting

Page 47: cvpr2011: human activity recognition - part 5: description based

47

Hierarchical recognition

Atomic action detections as new features

Localization ability enables hierarchical

recognition

Page 48: cvpr2011: human activity recognition - part 5: description based

48

Experiments

KTH dataset

Public dataset composed of simple actions

Walking, jogging, running, waving, …

Page 49: cvpr2011: human activity recognition - part 5: description based

49

Experiments: high-level activities

High-level human activity detection results

Changing backgrounds, lighting conditions, …

Page 50: cvpr2011: human activity recognition - part 5: description based

50

STR-match summary

Detection from continuous videos

Localization using voting-based method

Noisy observations

Different backgrounds/lightings

Uncertainties

Human-human interactions

Hierarchical recognition

Future work

Hierarchy learning algorithm

Page 51: cvpr2011: human activity recognition - part 5: description based

Description-based: References Allen, J. F. and Ferguson, G., Actions and events in interval temporal logic. Journal of Logic and

Computation 1994.

Pinhanez, C. S. and Bobick, A. F., Human action detection using PNF propagation of temporal

constraints. CVPR 1998.

Siskind, J. M., Grounding the lexical semantics of verbs in visual perception using force dynamics

and event logic. Journal of Artificial Intelligence Research (JAIR) 15, 2001.

Nevatia, R., Zhao, T., and Hongeng, S., Hierarchical language-based representation of events in

video streams. In IEEE Workshop on Event Mining 2003.

Vu, V.-T., Bremond, F., and Thonnat, M., Automatic video interpretation: A novel algorithm for

temporal scenario recognition. IJCAI 2003.

Ryoo, M. S. and Aggarwal, J. K., Recognition of composite human activities through context-free

grammar based representation. CVPR 2006.

Ryoo, M. S. and Aggarwal, J. K., Hierarchical recognition of human activities interacting with objects.

CVPR-SLAM 2007.

Tran, S. D. and Davis, L. S., Event modeling and recognition using markov logic networks. ECCV

2008.

Ryoo, M. S. and Aggarwal, J. K., Semantic representation and recognition of continued and

recursive human activities. IJCV, 2009.

Ryoo, M. S. and Aggarwal, J. K., Spatio-temporal relationship match: Video structure comparison

for recognition of complex human activities. ICCV 2009.

Ryoo, M. S. and Aggarwal, J. K., Stochastic representation and recognition of high-level group

activities. IJCV, 2010.

51


Recommended