Watching Unlabeled Video Helps Learn New Human Actions...

Post on 26-Jul-2020

1 views 0 download

transcript

Reading

Walking

Riding horse

Phoning

Running

Playing instrument Taking photo Using computer

Riding bike

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots

Chao-Yeh Chen and Kristen Grauman The University of Texas at Austin

Problem Goal: learn actions from static images

Approach

Related Work

Represent body pose

Results

Recognizing activity in images

Conclusions

- Learn actions with discriminative pose and appearance features. e.g. [Maji et al. 2011, Yang et al. 2010, Yao et al. 2010, Delaitre et al. 2011] - Expand training data by mirroring images and videos. e.g. [Papageorgiou et al. 2000, Wang et al. 2009] - Synthesize images for action recognition and pose estimation. e.g. [Matikainen et al. 2011, Shakhnarovich et al. 2003, Grauman et al. 2003, Shotton et al. 2011,] - Ours: expanding the training set for “free” via pose dynamics learned from unlabeled data.

Representing body pose

Expand snapshots by pose dynamics learned from videos

Problems of static snapshots: - May have only few training examples for some actions. - Often limited to “canonical” instances of the action.

Hollywood Human Actions (unlabeled)

PASCAL VOC 2010 (11 verbs) Image

Stanford 40 Actions (10 selected verbs) Image

Recognizing activity in videos

- Augment training data without additional labeling cost by leveraging unlabeled video. - Simple but effective exemplar/manifold extrapolation strategies. - Significant advantage when labeled training examples are sparse. - Domain adaptation connects real and generated examples.

Unlabeled video data

Synthesize pose examples:

+ Few canonical snapshots Unlabeled video pool (generic human action)

Given training snapshot Infer pose before Infer pose after

+

Detected poselets PAV

P1

P2

P3 … P1 P2 P3

Assumptions: - Videos cover the space of human pose dynamics. - No action labels are given. - People are detectable and trackable.

Training snapshot Matched frame

Find

Nearby synthetic poses

Generate

1) Example based strategy

2) Manifold based strategy

- Poselet activation vector (PAV) by [Maji et al. 2011] - Each poselet captures part of the pose from a given viewpoint - Robust to occlusion and cluttered background

Poselet feature

Unlabeled video pool Training image

before after

t t+T t-T

Locally Linear Embedding [Roweis and Saul 2000]

Similarity function for temporal nearness and pose similarity:

Pose

Training image

Frames from video (source)

Static snapshots (target)

+

+

- -

-

-

- -

-

-

- Real

-

Overview Synthesize pose examples Few static snapshots

Unlabeled video data

Train with real & synthetic poses

Temporal

Train with real & synthetic poses

+

+

- -

-

-

- -

-

-

- +

+ +

+

Real Synthetic poses

-

Domain adaptation by [Daume III 2007]

Pose feature space Pose feature space

PASCAL(# train = 15) Stanford 10(# train = 17)

0.1

0.2

0.3

0.4

Accu

racy

(mAP

)

PASCAL Stanford 10 # of training images # of training images

15

0.1

0.2

0.3

0.1

0.2

17 20 30 37 50 20 25 33 40 50

Accu

racy

(mAP

)

Pose diversity among training images

Our

acc

urac

y ga

in

Most benefit when lack pose diversity Synthetic after Real trainings

Synthetic feature examples

Accu

racy

0.2

0.4

0.6

- Substantial gains when recognizing actions in video, since our method infers intermediate poses not covered in original snapshots.

Training Testing

Recognize action Our idea

Let the system: - Watch videos to learn how human poses change over time. - Infer nearby poses to expand the sparse training snapshots.

Synthetic before

Accuracy of video activity recognition on 78 testing videos from HMDB51+ASLAN+UCF data.

Generate Generate