Download - Imran Junejo, Emilie Dexter, Ivan Laptev, and Patrick …lear.inrialpes.fr/RecogWorkshop08/documents/ECCVposter laptev.pdf · Action Recognition Results CMU MoCap dataset ... Body

Action Recognition Results

CMU MoCap dataset (multi-view)•Projected tracks for 13 joints, 12 action classes• Simulated noise in joint tracking; six virtual cam-

erastest views

cam5

cam4cam3

cam2

cam1

cam6

trai

ning

view

s

92.1 89.0 76.2 71.3 73.2 84.8 81.1

87.2 92.7 83.5 72.6 64.6 78.7 79.9

78.7 83.5 89.0 90.9 67.7 61.0 78.5

78.0 75.6 88.4 90.9 72.6 63.4 78.2

81.1 73.8 76.8 83.5 95.7 80.5 81.9

86.0 88.4 73.8 76.2 78.0 91.5 82.3

90.9 90.2 87.8 90.9 92.7 90.9 90.5

cam

1ca

m2

cam

3ca

m4

cam

5ca

m6

All

cam

1

cam

2

cam

3

cam

4

cam

5

cam

6

All

80.6 95.2 0.0 96.8 29.2 100.0 97.2 64.6 96.3 100.0 99.6 68.8 90.5

bend

cartw

heels

drink

fjum

pfly

strok

e

golf

jjack

jump

kick

run

walk walktu

rn

All

Weizman dataset (single view)• 9 classes of actions, performed by 9 actors.•NNC recognition accuracy 95.3% with SSM-pos and

94.6% with SSM-of-ofx-ofy-hog, compared to 92.6% [Aliet al. ICCV’07]

IXMAS dataset (multi-view)•Dataset: 5 different views, 11 action classes, per-

formed 3 times by 10 actors.

camera 1 camera 2

camera 3camera 4

camera 5camera 1

camera 2

camera 3

camera 4

camera 5

“check watch” action “punch action” action

View Invariance Properties

• SSM “images” are stable under view changes

Experimental vali-dation with denseview sampling

•Compute gradient orientation at each SSM point•Estimate per-point orientation variance over views

“bend” action “kick” action

10 20 30 40 50 60 70 80 90 100 110

10

20

30

40

50

60

70

80

90

100

110

0

50

100

150

average std: 17.4◦ average std: 20.93◦

SSM-based Descriptor

•Local patch-based descriptorcentered at each point on thediagonal•Compute an 8-bin histogram

of SSM gradients for each ofthe 11 blocks hm

i

Action recognition: we represent each videoas a bag of local SSM descriptors H = (h1, ..., hn). Ap-ply either Nearest Neighbor Classifier (BoF-NNC) orSupport Vector Machine (BoF-SVM).

Temporal Multi-View Video Alignment

•Represent each SSM by sequences of local SSM de-scriptors H1 and H2.•Align H1 and H2 with Dynamic Time Warping.

50 100 150 200 250

50

100

150

200

250

Self-Similarity Matrices (SSM)

Temporal self-similarity matrix is defined as

M = (mi,j)T×T with elements mi,j = k(xi, xj).

mi,j denotes the similarity of observations xi, xj attimes i, j according to some kernel function k.

Trajectory-based SSM• k(xi, xj) =

∑p ||xp

i − xpj||2.

• xpi : 2D point position on track p at time i.

golf: side view golf: top viewA

B

C

time

time

B

A

C

time

time

“bend” action “kick” action

side

view

Time

Tim

e

5 10 15 20 25 30 35

5

10

15

20

25

30

35

Time

Tim

e

5 10 15 20 25 30 35

5

10

15

20

25

30

35

Time

Tim

e

10 20 30 40 50 60

10

20

30

40

50

60

Time

Tim

e

5 10 15 20 25 30 35 40 45 50 55

5

10

15

20

25

30

35

40

45

50

55

top

view

Time

Tim

e

5 10 15 20 25 30 35 40 45

5

10

15

20

25

30

35

40

45

Time

Tim

e

5 10 15 20 25 30 35

5

10

15

20

25

30

35

Time

Tim

e

10 20 30 40 50 60

10

20

30

40

50

60

Time

Tim

e

5 10 15 20 25 30 35 40 45 50 55

5

10

15

20

25

30

35

40

45

50

55

Actor 1 Actor 2 Actor 1 Actor 2

HoG-based SSM• k(xi, xj) = ||xi − xj||2.• xi: [Dalal&Triggs] person HoG descriptor.

“ben

d”

Time

Tim

e

10 20 30 40 50 60

10

20

30

40

50

60

Time

Tim

e

10 20 30 40 50 60 70 80

10

20

30

40

50

60

70

80

Time

Tim

e

10 20 30 40 50 60 70 80

10

20

30

40

50

60

70

80

Time

Tim

e

10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

80

90


Optical Flow based SSM• k(xi, xj) = ||xi − xj||2.• xi: OF components {vx, vy|vx|vy} computed in per-

son bounding box and concatenated into a vector.

“ben

d”

Time

Tim

e

10 20 30 40 50 60

10

20

30

40

50

60

Time

Tim

e

10 20 30 40 50 60 70 80

10

20

30

40

50

60

70

80

Time

Tim

e

10 20 30 40 50 60 70 80

10

20

30

40

50

60

70

80

Time

Tim

e

10 20 30 40 50 60 70 80 90

10

20

30

40

50

60

70

80

90


Objective•Human actions recognition under view changes.

Related work•Volumetric 3D reconstruction [Weinland et al. CVIU’06]•Body part trajectories, projective geometry [Yilmaz

and Shah ICCV’05], [Parameswaran and ChellapaIJCV’06]•View-stable 2D trajectory features [Rao et al. IJCV’02].•Projective geometry, no point correspondence [Wolf

and Zomet IJCV’06].

Problems• 2D/3D posture recovery is a difficult and gener-

ally unsolved problem.•Direct extension of multiple view geometry meth-

ods to human actions is difficult due to the hardcross-view correspondence problem.•Pure learning approach is difficult due to the lim-

ited number of action samples in different views.

Hypothesis•View-invariance for non-rigid motion might be an

easier problem compared to static scenes due tothe additional time dimension.

This paper•Cross-view action recognition under weak assump-

tions:– Only one test view– Different training view(s)– No 2D/3D reconstruction– No multi-view point correspondence– Assuming bounding box person localization

CROSS-VIEW ACTION RECOGNITION FROM TEMPORAL SELF-SIMILARITIES

Imran Junejo, Emilie Dexter, Ivan Laptev, and Patrick Perez

IRISA / INRIA Rennes, Campus universitaire de Beaulieu35042 Rennes Cedex France