Reading
Walking
Riding horse
Phoning
Running
Playing instrument Taking photo Using computer
Riding bike
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots
Chao-Yeh Chen and Kristen Grauman The University of Texas at Austin
Problem Goal: learn actions from static images
Approach
Related Work
Represent body pose
Results
Recognizing activity in images
Conclusions
- Learn actions with discriminative pose and appearance features. e.g. [Maji et al. 2011, Yang et al. 2010, Yao et al. 2010, Delaitre et al. 2011] - Expand training data by mirroring images and videos. e.g. [Papageorgiou et al. 2000, Wang et al. 2009] - Synthesize images for action recognition and pose estimation. e.g. [Matikainen et al. 2011, Shakhnarovich et al. 2003, Grauman et al. 2003, Shotton et al. 2011,] - Ours: expanding the training set for “free” via pose dynamics learned from unlabeled data.
Representing body pose
Expand snapshots by pose dynamics learned from videos
Problems of static snapshots: - May have only few training examples for some actions. - Often limited to “canonical” instances of the action.
Hollywood Human Actions (unlabeled)
PASCAL VOC 2010 (11 verbs) Image
Stanford 40 Actions (10 selected verbs) Image
Recognizing activity in videos
- Augment training data without additional labeling cost by leveraging unlabeled video. - Simple but effective exemplar/manifold extrapolation strategies. - Significant advantage when labeled training examples are sparse. - Domain adaptation connects real and generated examples.
Unlabeled video data
Synthesize pose examples:
+ Few canonical snapshots Unlabeled video pool (generic human action)
Given training snapshot Infer pose before Infer pose after
+
Detected poselets PAV
P1
P2
P3 … P1 P2 P3
Assumptions: - Videos cover the space of human pose dynamics. - No action labels are given. - People are detectable and trackable.
Training snapshot Matched frame
Find
Nearby synthetic poses
Generate
1) Example based strategy
2) Manifold based strategy
- Poselet activation vector (PAV) by [Maji et al. 2011] - Each poselet captures part of the pose from a given viewpoint - Robust to occlusion and cluttered background
Poselet feature
Unlabeled video pool Training image
before after
t t+T t-T
Locally Linear Embedding [Roweis and Saul 2000]
Similarity function for temporal nearness and pose similarity:
Pose
Training image
Frames from video (source)
Static snapshots (target)
+
+
- -
-
-
- -
-
-
- Real
-
Overview Synthesize pose examples Few static snapshots
Unlabeled video data
Train with real & synthetic poses
Temporal
Train with real & synthetic poses
+
+
- -
-
-
- -
-
-
- +
+ +
+
Real Synthetic poses
-
Domain adaptation by [Daume III 2007]
Pose feature space Pose feature space
PASCAL(# train = 15) Stanford 10(# train = 17)
0.1
0.2
0.3
0.4
Accu
racy
(mAP
)
PASCAL Stanford 10 # of training images # of training images
15
0.1
0.2
0.3
0.1
0.2
17 20 30 37 50 20 25 33 40 50
Accu
racy
(mAP
)
Pose diversity among training images
Our
acc
urac
y ga
in
Most benefit when lack pose diversity Synthetic after Real trainings
Synthetic feature examples
Accu
racy
0.2
0.4
0.6
- Substantial gains when recognizing actions in video, since our method infers intermediate poses not covered in original snapshots.
Training Testing
Recognize action Our idea
Let the system: - Watch videos to learn how human poses change over time. - Infer nearby poses to expand the sparse training snapshots.
Synthetic before
Accuracy of video activity recognition on 78 testing videos from HMDB51+ASLAN+UCF data.
Generate Generate