Date post: | 03-Dec-2014 |
Category: |
Entertainment & Humor |
Upload: | zukun |
View: | 856 times |
Download: | 0 times |
Human Action Recognition by Learning Bases of Action
Attributes and Parts
Bangpeng Yao, Xiaoye Jiang, Aditya Khosla,
Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei
1
Stanford University
2
Action Classification in Still Images
Low level feature
Yao & Fei-Fei, 2010Koniusz et al., 2010Delaitre et al., 2010Yao et al., 2011
Riding bike
3
Action Classification in Still Images
Riding a bikeSitting on a bike seatWearing a helmetPeddling the pedals…
- Semantic concepts – Attributes
Low level feature
Yao & Fei-Fei, 2010Koniusz et al., 2010Delaitre et al., 2010Yao et al., 2011
High-level representationRiding bike
4
Action Classification in Still Images
- Semantic concepts – Attributes- Objects
Riding a bikeSitting on a bike seatWearing a helmetPeddling the pedals…
Low level feature
Yao & Fei-Fei, 2010Koniusz et al., 2010Delaitre et al., 2010Yao et al., 2011
High-level representationRiding bike
5
Action Classification in Still Images
- Semantic concepts – Attributes- Objects- Human poses
Parts
Riding a bikeSitting on a bike seatWearing a helmetPeddling the pedals…
Low level feature
Yao & Fei-Fei, 2010Koniusz et al., 2010Delaitre et al., 2010Yao et al., 2011
High-level representationRiding bike
6
Action Classification in Still Images
- Semantic concepts – Attributes- Objects- Human poses- Contexts of attributes & parts
Parts
Riding a bikeSitting on a bike seatWearing a helmetPeddling the pedals…
Riding
Low level feature
Yao & Fei-Fei, 2010Koniusz et al., 2010Delaitre et al., 2010Yao et al., 2011
High-level representationRiding bike
7
Low level feature
Yao & Fei-Fei, 2010Koniusz et al., 2010Delaitre et al., 2010Yao et al., 2011
- Semantic concepts – Attributes- Objects- Human poses- Contexts of attributes & parts
High-level representation
Parts
riding a bike
wearing a helmet
Peddling the pedal
sitting on bike seat
Farhadi et al., 2009Lampert et al., 2009Berg et al., 2010Parikh & Grauman, 2011
Gupta et al., 2009Yao & Fei-Fei, 2010Torresani et al., 2010Li et al., 2010
Yang et al., 2010Maji et al., 2011Liu et al., 2011
Incorporate human knowledge; More understanding of image content; More discriminative classifier.
Action Classification in Still Images
Riding bike
• Intuition: Action Attributes and Parts
• Algorithm: Learning Bases of Attributes
and Parts
• Experiments: PASCAL VOC & Stanford
40 Actions
• Conclusion
Outline
8
• Intuition: Action Attributes and Parts
• Algorithm: Learning Bases of Attributes
and Parts
• Experiments: PASCAL VOC & Stanford
40 Actions
• Conclusion
Outline
9
10
Action Attributes and Parts
Attributes:
… …
semantic descriptions of human actions
11
Action Attributes and Parts
Attributes:
… …
semantic descriptions of human actions
Riding bike Not
riding bike
Lampert et al., 2009Berg et al., 2010
Discriminative classifier, e.g. SVM
12
Action Attributes and Parts
Attributes:
… …
Parts-Objects:
… …
Parts-Poselets:
… …
A pre-trained detector
Object Bank, Li et al., 2010Poselet, Bourdev & Malik, 2009
13
Action Attributes and Parts
Attributes:
… …
Parts-Objects:
… …
Parts-Poselets:
… …
Attribute classification
Object detection
Poselet detection
a: Image feature vector
14
Action Attributes and Parts
Attributes:
… …
Parts-Objects:
… …
Parts-Poselets:
… …
Attribute classification
Object detection
Poselet detection
a: Image feature vector
…
Action bases Φ
15
Action Attributes and Parts
Attributes:
… …
Parts-Objects:
… …
Parts-Poselets:
… …
a: Image feature vector
…
Action bases Φ
16
Action Attributes and Parts
Attributes:
… …
Parts-Objects:
… …
Parts-Poselets:
… …
a: Image feature vector
…
Action bases Φ
17
Action Attributes and Parts
Attributes:
… …
Parts-Objects:
… …
Parts-Poselets:
… …
…
Action bases
Bases coefficients w
Φ
a: Image feature vector
a Φw
18
Action Attributes and Parts
Attributes:
… …
Parts-Objects:
… …
Parts-Poselets:
… …
…
Action bases
Bases coefficients w
Φ
a: Image feature vector
a Φw
• Sparse• Encodes context• Robust to initially weak detections
• Intuition: Action Attributes and Parts
• Algorithm: Learning Bases of
Attributes and Parts
• Experiments: PASCAL VOC & Stanford
40 Actions
• Conclusion
Outline
19
20
Bases of Atr. & Parts: Training
w
Φa
a Φw
• Input: 1, , Na a
• Output: 1, , MΦ Φ Φ
1, , NW w wsparse
2
2 1,1
1min ,
2
N
i i ii
Φ W
a Φw w
2
1 2s.t. , 1
2j jj
Φ Φ
L1 regularization, sparsity of W
Elastic net, sparsity of [Zou & Hasti, 2005]
Accurate approximation
• Jointly estimate and :Φ W
• Optimization: stochastic gradient descent.
Φ
…
21
Bases of Atr. & Parts: Testing
…
w
Φa
a Φw
• Input: a
• Output:
1, , MΦ Φ Φ
w sparse
• Estimate w:
• Optimization: stochastic gradient descent.
2
2 1
1min
2
wa Φw w
L1 regularization, sparsity of WAccurate approximation
• Intuition: Action Attributes and Parts
• Algorithm: Learning Bases of Attributes
and Parts
• Experiments: PASCAL VOC & Stanford
40 Actions
• Conclusion
Outline
22
23
PASCAL VOC 2010 Action Dataset
Figure credit: Ivan Laptev
• 9 classes, 50-100 trainval / testing images per class
14 attributes – trained from the trainval images;27 objects – taken from Li et al, NIPS 2010;150 poselets – taken from Bourdev & Malik, ICCV 2009.
•
24
VOC 2010: Classification Result
1 2 3 4 5 6 7 8 9
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Phoning Playing instrument
Reading Riding bike
Riding horse
Running Taking photo
Using computer
Walking
Ave
rag
e p
reci
sio
n
Our method, use “a”
Poselet, Maji et al, 2011
SURREY_MKUCLEAR_DOSP
…
w
Φa
25
…
w
Φa
1 2 3 4 5 6 7 8 9
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Phoning Playing instrument
Reading Riding bike
Riding horse
Running Taking photo
Walking
Our method, use “a”Our method, use “w”
Poselet, Maji et al, 2011
SURREY_MKUCLEAR_DOSP
Ave
rag
e p
reci
sio
n
Using computer
VOC 2010: Classification Result
26
…
w
Φa
1 2 3 4 5 6 7 8 9
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Phoning Playing instrument
Reading Riding bike
Riding horse
Running Taking photo
Walking
Our method, use “a”Our method, use “w”
Poselet, Maji et al, 2011
SURREY_MKUCLEAR_DOSP
Ave
rag
e p
reci
sio
n
Using computer
400 action bases
attributesobjects
poselets
VOC 2010: Analysis of Bases
27
…
w
Φa
1 2 3 4 5 6 7 8 9
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Phoning Playing instrument
Reading Riding bike
Riding horse
Running Taking photo
Walking
Our method, use “a”Our method, use “w”
Poselet, Maji et al, 2011
SURREY_MKUCLEAR_DOSP
Ave
rag
e p
reci
sio
n
Using computer
400 action bases
attributesobjects
poselets
VOC 2010: Analysis of Bases
28
…
w
Φa
1 2 3 4 5 6 7 8 9
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Phoning Playing instrument
Reading Riding bike
Riding horse
Running Taking photo
Walking
Our method, use “a”Our method, use “w”
Poselet, Maji et al, 2011
SURREY_MKUCLEAR_DOSP
Ave
rag
e p
reci
sio
n
Using computer
400 action bases
attributesobjects
poselets
VOC 2010: Analysis of Bases
29
VOC 2010: Control Experiment
…
w
ΦaA+O+P A+O A+P O+P
0.45
0.5
0.55
0.6
0.65
0.7
Mea
n av
erag
e pr
ecis
ion
Use “a”
Use “w”
A: attributeO: objectP: poselet
30
PASCAL VOC 2011 Result
• Our method ranks the first in nine out of ten classes in comp10.
Others’ best in comp9
Others’ best in comp10
Our method
Jumping 71.6 59.5 66.7
Phoning 50.7 31.3 41.1
Playing instrument 77.5 45.6 60.8
Reading 37.8 27.8 42.2
Riding bike 88.8 84.4 90.5
Riding horse 90.2 88.3 92.2
Running 87.9 77.6 86.2
Taking photo 25.7 31.0 28.8
Using computer 58.9 47.4 63.5
Walking 59.5 57.6 64.2
31
PASCAL VOC 2011 Result
Others’ best in comp9
Others’ best in comp10
Our method
Jumping 71.6 59.5 66.7
Phoning 50.7 31.3 41.1
Playing instrument 77.5 45.6 60.8
Reading 37.8 27.8 42.2
Riding bike 88.8 84.4 90.5
Riding horse 90.2 88.3 92.2
Running 87.9 77.6 86.2
Taking photo 25.7 31.0 28.8
Using computer 58.9 47.4 63.5
Walking 59.5 57.6 64.2
• Our method achieves the best performance in five out of ten classes if we consider both comp9 and comp10.
32
Stanford 40 Actions
Applauding Blowing bubbles
Brushing teeth
Calling Cleaning floor
Climbing wall
Cooking Cutting trees
Cutting vegetables
Drinking Feeding horse
Fishing Fixing bike
Gardening Holding umbrella
Jumping
Playing guitar
Playing violin
Pouring liquid
Pushing cart
Reading Repairing car
Riding bike
Riding horse
Rowing Running Shooting arrow
Smoking cigarette
Taking photo
Texting message
Throwing frisbee
Using computer
Using microscope
Using telescope
Walking dog
Washing dishes
Watching television
Waving hands
Writing on board
Writing on paper
http://vision.stanford.edu/Datasets/40actions.html
• 40 actions classes, 9532 real world images from Google, Flickr, etc.
33
Stanford 40 Actions
Applauding Blowing bubbles
Brushing teeth
Calling Cleaning floor
Climbing wall
Cooking Cutting trees
Cutting vegetables
Drinking Feeding horse
Fishing Fixing bike
Gardening Holding umbrella
Jumping
Playing guitar
Playing violin
Pouring liquid
Pushing cart
Reading Repairing car
Riding bike
Riding horse
Rowing Running Shooting arrow
Smoking cigarette
Taking photo
Texting message
Throwing frisbee
Using computer
Using microscope
Using telescope
Walking dog
Washing dishes
Watching television
Waving hands
Writing on board
Writing on paper
http://vision.stanford.edu/Datasets/40actions.html
• 40 actions classes, 9532 real world images from Google, Flickr, etc.
Riding bike
Fixing bike
34
Stanford 40 Actions
Applauding Blowing bubbles
Brushing teeth
Calling Cleaning floor
Climbing wall
Cooking Cutting trees
Cutting vegetables
Drinking Feeding horse
Fishing Fixing bike
Gardening Holding umbrella
Jumping
Playing guitar
Playing violin
Pouring liquid
Pushing cart
Reading Repairing car
Riding bike
Riding horse
Rowing Running Shooting arrow
Smoking cigarette
Taking photo
Texting message
Throwing frisbee
Using computer
Using microscope
Using telescope
Walking dog
Washing dishes
Watching television
Waving hands
Writing on board
Writing on paper
http://vision.stanford.edu/Datasets/40actions.html
• 40 actions classes, 9532 real world images from Google, Flickr, etc.
Writing on board
Writing on paper
35
Stanford 40 Actions
Applauding Blowing bubbles
Brushing teeth
Calling Cleaning floor
Climbing wall
Cooking Cutting trees
Cutting vegetables
Drinking Feeding horse
Fishing Fixing bike
Gardening Holding umbrella
Jumping
Playing guitar
Playing violin
Pouring liquid
Pushing cart
Reading Repairing car
Riding bike
Riding horse
Rowing Running Shooting arrow
Smoking cigarette
Taking photo
Texting message
Throwing frisbee
Using computer
Using microscope
Using telescope
Walking dog
Washing dishes
Watching television
Waving hands
Writing on board
Writing on paper
http://vision.stanford.edu/Datasets/40actions.html
• 40 actions classes, 9532 real world images from Google, Flickr, etc.
Drinking Gardening
Smoking Cigarette
36
Stanford 40 Actions: Result• We use 45 attributes, 81 objects, and 150 poselets.• Compare our method with the Locality-constrained Linear Coding (LLC, Wang et al, CVPR 2010) baseline.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Riding
a h
orse
Rowing
a b
oat
Riding
a b
ike
Climbin
g m
ount
ain
Jum
ping
Cleanin
g th
e flo
or
Wal
king
a do
g
Shoot
ing a
n ar
row
Playin
g gu
itar
Fishin
g
Holding
up
an u
mbr
ella
Runni
ng
Throw
ing
a fri
sbee
Writ
ing
on a
boa
rd
Wat
chin
g TV
Cuttin
g tre
es
Feedin
g a
hors
e
Garde
ning
Writ
ing
on a
boo
k
Repai
ring
a ca
r
Look
ing th
ru a
micr
osco
pe
Cuttin
g ve
geta
bles
Blowing
bub
bles
Playin
g vio
lin
Brush
ing te
eth
Repai
ring
a bi
ke
Pushin
g a
cart
Using
a co
mpu
ter
Appla
uding
Cookin
g
Smok
ing c
igare
tte
Look
ing th
ru a
teles
cope
Was
hing
dishe
s
Drinkin
g
Calling
Wav
ing h
ands
Pourin
g liq
uid
Readi
ng a
boo
k
Taking
pho
tos
Textin
g m
essa
ge
LLC
Our Method
Ave
rage
pre
cisi
on
37
Stanford 40 Actions: Result
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Riding
a h
orse
Rowing
a b
oat
Riding
a b
ike
Climbin
g m
ount
ain
Jum
ping
Cleanin
g th
e flo
or
Wal
king
a do
g
Shoot
ing a
n ar
row
Playin
g gu
itar
Fishin
g
Holding
up
an u
mbr
ella
Runni
ng
Throw
ing
a fri
sbee
Writ
ing
on a
boa
rd
Wat
chin
g TV
Cuttin
g tre
es
Feedin
g a
hors
e
Garde
ning
Writ
ing
on a
boo
k
Repai
ring
a ca
r
Look
ing th
ru a
micr
osco
pe
Cuttin
g ve
geta
bles
Blowing
bub
bles
Playin
g vio
lin
Brush
ing te
eth
Repai
ring
a bi
ke
Pushin
g a
cart
Using
a co
mpu
ter
Appla
uding
Cookin
g
Smok
ing c
igare
tte
Look
ing th
ru a
teles
cope
Was
hing
dishe
s
Drinkin
g
Calling
Wav
ing h
ands
Pourin
g liq
uid
Readi
ng a
boo
k
Taking
pho
tos
Textin
g m
essa
ge
LLC
Our Method
Ave
rage
pre
cisi
on
• Intuition: Action Attributes and Parts
• Algorithm: Learning Bases of Attributes
and Parts
• Experiments: PASCAL VOC & Stanford
40 Actions
• Conclusion
Outline
38
39
Conclusion
Attributes:
… …
Parts-Objects:
… …
Parts-Poselets:
… …
…
Action bases
Bases coefficients w
Φ
a: Image feature vector
a Φw
40
Acknowledgement