Watching Unlabeled Video Helps Learn New Human Actions...

transcript

Reading

Walking

Riding horse

Phoning

Running

Playing instrument Taking photo Using computer

Riding bike

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots

Chao-Yeh Chen and Kristen Grauman The University of Texas at Austin

Problem Goal: learn actions from static images

Approach

Related Work

Represent body pose

Results

Recognizing activity in images

Conclusions

- Learn actions with discriminative pose and appearance features. e.g. [Maji et al. 2011, Yang et al. 2010, Yao et al. 2010, Delaitre et al. 2011] - Expand training data by mirroring images and videos. e.g. [Papageorgiou et al. 2000, Wang et al. 2009] - Synthesize images for action recognition and pose estimation. e.g. [Matikainen et al. 2011, Shakhnarovich et al. 2003, Grauman et al. 2003, Shotton et al. 2011,] - Ours: expanding the training set for “free” via pose dynamics learned from unlabeled data.

Representing body pose

Expand snapshots by pose dynamics learned from videos

Problems of static snapshots: - May have only few training examples for some actions. - Often limited to “canonical” instances of the action.

Hollywood Human Actions (unlabeled)

PASCAL VOC 2010 (11 verbs) Image

Stanford 40 Actions (10 selected verbs) Image

Recognizing activity in videos

- Augment training data without additional labeling cost by leveraging unlabeled video. - Simple but effective exemplar/manifold extrapolation strategies. - Significant advantage when labeled training examples are sparse. - Domain adaptation connects real and generated examples.

Unlabeled video data

Synthesize pose examples:

+ Few canonical snapshots Unlabeled video pool (generic human action)

Given training snapshot Infer pose before Infer pose after

Detected poselets PAV

P3 … P1 P2 P3

Assumptions: - Videos cover the space of human pose dynamics. - No action labels are given. - People are detectable and trackable.

Training snapshot Matched frame

Nearby synthetic poses

Generate

1) Example based strategy

2) Manifold based strategy

- Poselet activation vector (PAV) by [Maji et al. 2011] - Each poselet captures part of the pose from a given viewpoint - Robust to occlusion and cluttered background

Poselet feature

Unlabeled video pool Training image

before after

t t+T t-T

Locally Linear Embedding [Roweis and Saul 2000]

Similarity function for temporal nearness and pose similarity:

Training image

Frames from video (source)

Static snapshots (target)

- Real

Overview Synthesize pose examples Few static snapshots

Unlabeled video data

Train with real & synthetic poses

Temporal

Train with real & synthetic poses

Real Synthetic poses

Domain adaptation by [Daume III 2007]

Pose feature space Pose feature space

PASCAL(# train = 15) Stanford 10(# train = 17)

PASCAL Stanford 10 # of training images # of training images

17 20 30 37 50 20 25 33 40 50

Pose diversity among training images

Most benefit when lack pose diversity Synthetic after Real trainings

Synthetic feature examples

- Substantial gains when recognizing actions in video, since our method infers intermediate poses not covered in original snapshots.

Training Testing

Recognize action Our idea

Let the system: - Watch videos to learn how human poses change over time. - Infer nearby poses to expand the sparse training snapshots.

Synthetic before

Accuracy of video activity recognition on 78 testing videos from HMDB51+ASLAN+UCF data.

Generate Generate

Watching Unlabeled Video Helps Learn New Human Actions...

Documents