UCF101: A Dataset of 101 Human ActionsClasses From Videos in The Wild
Khurram Soomro, Amir Roshan Zamir and Mubarak Shah
CRCV-TR-12-01
November 2012
Keywords: Action Dataset, UCF101, UCF50, Action Recognition
Center for Research in Computer Vision
University of Central Florida
4000 Central Florida Blvd.
Orlando, FL 32816-2365 USA
arX
iv:1
212.
0402
v1 [
cs.C
V]
3 D
ec 2
012
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Khurram Soomro, Amir Roshan Zamir and Mubarak ShahCenter for Research in Computer Vision, Orlando, FL 32816, USA
{ksoomro, aroshan, shah}@cs.ucf.eduhttp://crcv.ucf.edu/data/UCF101.php
Abstract
We introduce UCF101 which is currently the largestdataset of human actions. It consists of 101 action classes,over 13k clips and 27 hours of video data. The databaseconsists of realistic user-uploaded videos containing cam-era motion and cluttered background. Additionally, we pro-vide baseline action recognition results on this new datasetusing standard bag of words approach with overall perfor-mance of 44.5%. To the best of our knowledge, UCF101is currently the most challenging dataset of actions due toits large number of classes, large number of clips and alsounconstrained nature of such clips.
1. IntroductionThe majority of existing action recognition datasets suf-
fer from two disadvantages: 1) The number of their classesis typically very low compared to the richness of performedactions by humans in reality, e.g. KTH [11], Weizmann [3],UCF Sports [10], IXMAS [12] datasets includes only 6, 9,9, 11 classes respectively. 2) The videos are recorded in un-realistically controlled environments. For instance, KTH,Weizmann, IXMAS are staged by actors; HOHA [7] andUCF Sports are composed of movie clips captured by pro-fessional filming crew. Recently, web videos have beenused in order to utilize unconstrained user-uploaded data toalleviate the second issue [6, 8, 9, 5]. However, the first dis-advantage remains unresolved as the largest existing datasetdoes not include more than 51 actions while several worksshowed that the number of classes play a crucial role in eval-uating an action recognition method [4, 9]. Therefore, wehave compiled a new dataset with 101 actions and 13320clips which is nearly twice bigger than the largest existingdataset in terms of number of actions and clips. (HMDB51[5] and UCF50 [9] are the currently the largest ones with6766 clips of 51 actions and 6681 clips of 50 actions re-spectively.)
The dataset is composed of web videos which arerecorded in unconstrained environments and typically in-
Apply Eye Makeup Baby Crawling
Haircut
Playing Dhol
Sky Diving Surfing
Shaving Beard Cricket Shot Rafting
Figure 1. Sample frames for 6 action classes of UCF101.
clude camera motion, various lighting conditions, partialocclusion, low quality frames, etc. Fig. 1 shows sampleframes of 6 action classes from UCF101.
2. Dataset Details
Action Classes: UCF101 includes total number of101 action classes which we have divided into five types:Human-Object Interaction, Body-Motion Only, Human-Human Interaction, Playing Musical Instruments, Sports.
UCF101 is an extension of UCF50 which included thefollowing 50 action classes: {Baseball Pitch, BasketballShooting, Bench Press, Biking, Billiards Shot, Breaststroke,Clean and Jerk, Diving, Drumming, Fencing, Golf Swing,High Jump, Horse Race, Horse Riding, Hula Hoop, JavelinThrow,, Juggling Balls, Jumping Jack, Jump Rope, Kayak-ing, Lunges, Military Parade, Mixing Batter, Nun chucks,Pizza Tossing, Playing Guitar, Playing Piano, PlayingTabla, Playing Violin, Pole Vault, Pommel Horse, Pull Ups,Punch, Push Ups, Rock Climbing Indoor, Rope Climbing,Rowing, Salsa Spins, Skate Boarding, Skiing, Skijet, Soc-cer Juggling, Swing, TaiChi, Tennis Swing, Throw Discus,
2
Hula Hoop Juggling Balls Jump Rope
Skate Boarding Pizza Tossing Nun Chucks Mixing Batter
Yo Yo
Apply Eye Makeup Blow Dry Hair Apply Lipstick Cutting In Kitchen Hammering
Knitting Mopping Floor Shaving Beard
Writing On Board
Typing
Brushing Teeth
Soccer Juggling
Walking with a Dog Swing Rope Climbing Push ups Trampoline Jumping Tai Chi Rock Climbing Indoor
Jumping Jack Lunges
Pull ups
Blowing Candles Body Weight Squats Handstand Pushups Handstand Walking
Wall Pushups
Baby Crawling
Military Parade Salsa Spin Band Marching Haircut Head Massage
Playing Tabla Playing Piano Playing Guitar
Drumming
Playing Violin
Playing Cello Playing Daf Playing Dhol
Playing Flute Playing Sitar
Bench Press Basketball
Baseball Pitch
Billiard Breaststroke
Clean and Jerk Diving Fencing
Golf Swing
Rowing Punch Pommel Horse Pole Vault
Kayaking Javelin Throw Horse Riding Horse Race High Jump
Skiing
Jetski Tennis Swing Throw Discus
Volleyball Spiking
Archery Balance Beam
Basketball Dunk Bowling
Front Crawl
Frisbee Catch Floor Gymnastics Field Hockey Penalty Cricket Shot Cricket Bowling Cliff Diving
Boxing-Speed Bag Boxing-Punching Bag
Hammer Throw Ice Dancing
Long Jump Parallel Bars Rafting Shotput
Sky Diving Soccer Penalty Still Rings
Biking
Uneven Bars
Table Tennis Shot Surfing Sumo Wrestling
Figure 2. 101 actions included in UCF101 shown with one sample frame. The color of frame borders specifies to which action type theybelong: Human-Object Interaction, Body-Motion Only, Human-Human Interaction, Playing Musical Instruments, Sports.
Trampoline Jumping, Volleyball Spiking, Walking with adog, Yo Yo}. The color class labels specify which prede-fined action type they belong to.
The following 51 new classes are introduced in UCF101:{Apply Eye Makeup, Apply Lipstick, Archery, Baby Crawl-ing, Balance Beam, Band Marching, Basketball Dunk, BlowDrying Hair, Blowing Candles, Body Weight Squats, Bowl-
ing,Boxing-Punching Bag, Boxing-Speed Bag, BrushingTeeth, Cliff Diving, Cricket Bowling, Cricket Shot, Cut-ting In Kitchen, Field Hockey Penalty, Floor Gymnastics,Frisbee Catch, Front Crawl, Hair cut, Hammering, Ham-mer Throw, Handstand Pushups, Handstand Walking, HeadMassage, Ice Dancing, Knitting, Long Jump, MoppingFloor, Parallel Bars, Playing Cello, Playing Daf, Playing
0
30
60
90
120
150
180A
pp
lyE
yeM
akeu
p
App
lyL
ipst
ick
Arc
her
y
Bab
yC
raw
lin
g
Bal
ance
Bea
m
Ban
dM
arch
ing
Bas
ebal
lPit
ch
Bas
ket
bal
l
Bas
ket
bal
lDun
k
Ben
chP
ress
Bik
ing
Bil
liar
ds
Blo
wD
ryH
air
Blo
win
gC
andle
s
Bo
dyW
eigh
tSqu
ats
Bow
ling
Boxin
gP
unch
ingB
ag
Boxin
gS
pee
dB
ag
Bre
astS
troke
Bru
shin
gT
eeth
Cle
anA
ndJe
rk
Cli
ffD
ivin
g
Cri
cket
Bo
wli
ng
Cri
cket
Sh
ot
Cutt
ingIn
Kit
chen
Div
ing
Dru
mm
ing
Fen
cing
Fie
ldH
ock
eyP
enal
ty
Flo
orG
ym
nas
tics
Fri
sbee
Cat
ch
Fro
ntC
raw
l
Golf
Sw
ing
Hai
rcut
Ham
mer
ing
Ham
mer
Th
row
Han
dst
and
Push
up
s
Han
dst
and
Wal
kin
g
Hea
dM
assa
ge
Hig
hJu
mp
Hors
eRac
e
Hors
eRid
ing
Hula
Hoo
p
IceD
anci
ng
Javel
inT
hro
w
Juggli
ngB
alls
Jum
pin
gJa
ck
Jum
pR
ope
Kay
akin
g
Knit
tin
g
Nu
mb
er o
f C
lip
s
0
30
60
90
120
150
180
Lon
gJu
mp
Lu
ng
es
Mil
itar
yP
arad
e
Mix
ing
Mop
pin
gF
loo
r
Nu
nch
uck
s
Par
alle
lBar
s
Piz
zaT
oss
ing
Pla
yin
gC
ello
Pla
yin
gD
af
Pla
yin
gD
ho
l
Pla
yin
gF
lute
Pla
yin
gG
uit
ar
Pla
yin
gP
iano
Pla
yin
gS
itar
Pla
yin
gT
abla
Pla
yin
gV
ioli
n
Pole
Vau
lt
Pom
mel
Hors
e
Pu
llU
ps
Pu
nch
Pu
shU
ps
Raf
ting
RockClimbi…
Rop
eCli
mbin
g
Row
ing
Sal
saS
pin
Shav
ingB
eard
Sh
otp
ut
Sk
ateB
oar
din
g
Sk
iin
g
Sk
ijet
Sky
Div
ing
Socc
erJu
ggli
ng
Socc
erP
enal
ty
Sti
llR
ings
Su
moW
rest
lin
g
Surf
ing
Sw
ing
TableTennis…
Tai
Chi
Ten
nis
Sw
ing
Thro
wD
iscu
s
TrampolineJ…
Ty
pin
g
Unev
enB
ars
VolleyballSp…
WalkingWit…
Wal
lPush
up
s
WritingOnB…
Yo
Yo
Nu
mb
er o
f C
lip
s
> 10.0 Sec 5.0 - 10.0 Sec 2.0 - 5.0 Sec 0.0 - 2.0 Sec
Figure 3. Number of clips per action class. The distribution of clip durations is illustrated by the colors.
Dhol, Playing Flute, Playing Sitar, Rafting, Shaving Beard,Shot put, Sky Diving, Soccer Penalty, Still Rings, SumoWrestling, Surfing, Table Tennis Shot, Typing, Uneven Bars,Wall Pushups, Writing On Board}. Fig. 2 shows a sampleframe for each action class of UCF101.
Clip Groups: The clips of one action class are dividedinto 25 groups which contain 4-7 clips each. The clips inone group share some common features, such as the back-ground or actors.
The bar chart of Fig. 3 shows the number of clips ineach class. The colors on each bar illustrate the durationsof different clips included in that class. The chart shown inFig. 4 illustrates the average clip length (green) and totalduration of clips (blue) for each action class.
The videos are downloaded from YouTube [2] and theirrelevant ones are manually removed. All clips have fixedframe rate and resolution of 25 FPS and 320× 240 respec-tively. The videos are saved in .avi files compressed us-ing DivX codec available in k-lite package [1]. The audiois preserved for the clips of the new 51 actions. Table 1summarizes the characteristics of the dataset.
Actions 101Clips 13320
Groups per Action 25Clips per Group 4-7
Mean Clip Length 7.21 secTotal Duration 1600 mins
Min Clip Length 1.06 secMax Clip Length 71.04 sec
Frame Rate 25 fpsResolution 320×240
Audio Yes (51 actions)
Table 1. Summary of Characteristics of UCF101
Naming Convention: The zipped file of the dataset(available at http://crcv.ucf.edu/data/UCF101.php ) includes 101 folders each containingthe clips of one action class. The name of each clip has thefollowing form:
v X gY cZ.avi
0
2
4
6
8
10
12
14
16
0
500
1000
1500
2000
2500
Apply
Ey
eMak
eup
Apply
Lip
stic
k
Arc
her
y
Bab
yC
raw
lin
g
Bal
ance
Bea
m
Ban
dM
arch
ing
Bas
ebal
lPit
ch
Bas
ket
bal
l
Bas
ket
bal
lDunk
Ben
chP
ress
Bik
ing
Bil
liar
ds
Blo
wD
ryH
air
Blo
win
gC
andle
s
Body
Wei
gh
tSquat
s
Bow
ling
Bo
xin
gP
unch
ingB
ag
Bo
xin
gS
pee
dB
ag
Bre
astS
trok
e
Bru
shin
gT
eeth
Cle
anA
ndJe
rk
Cli
ffD
ivin
g
Cri
cket
Bo
wli
ng
Cri
cket
Shot
Cu
ttin
gIn
Kit
chen
Div
ing
Dru
mm
ing
Fen
cing
Fie
ldH
ock
eyP
enal
ty
Flo
orG
ym
nas
tics
Fri
sbee
Cat
ch
Fro
ntC
raw
l
Go
lfS
win
g
Hai
rcut
Ham
mer
ing
Ham
mer
Thro
w
Han
dst
andP
ush
ups
Han
dst
andW
alkin
g
Hea
dM
assa
ge
Hig
hJu
mp
Hors
eRac
e
Hors
eRid
ing
Hula
Hoop
IceD
anci
ng
Javel
inT
hro
w
Juggli
ngB
alls
Jum
pin
gJa
ck
Jum
pR
ope
Kay
akin
g
Kn
itti
ng
Tim
e (s
ec)
Total Time
Average Clip Duration
0
2
4
6
8
10
12
14
16
0
500
1000
1500
2000
2500
Lo
ng
Jum
p
Lung
es
Mil
itar
yP
arad
e
Mix
ing
Mop
pin
gF
loor
Nu
nch
uck
s
Par
alle
lBar
s
Piz
zaT
oss
ing
Pla
yin
gC
ello
Pla
yin
gD
af
Pla
yin
gD
ho
l
Pla
yin
gF
lute
Pla
yin
gG
uit
ar
Pla
yin
gP
iano
Pla
yin
gS
itar
Pla
yin
gT
abla
Pla
yin
gV
ioli
n
Po
leV
ault
Po
mm
elH
ors
e
Pull
Ups
Pun
ch
Push
Ups
Raf
ting
Ro
ckC
lim
bin
gIn
do
or
Ro
peC
lim
bin
g
Ro
win
g
Sal
saS
pin
Shav
ingB
eard
Sho
tpu
t
Sk
ateB
oar
din
g
Sk
iin
g
Skij
et
Sky
Div
ing
Socc
erJu
ggli
ng
So
ccer
Pen
alty
Sti
llR
ing
s
Sum
oW
rest
ling
Su
rfin
g
Sw
ing
Tab
leT
enn
isS
ho
t
Tai
Chi
Ten
nis
Sw
ing
Th
row
Dis
cus
Tra
mpoli
neJ
um
pin
g
Ty
pin
g
Un
even
Bar
s
Vo
lley
bal
lSpik
ing
Wal
kin
gW
ithD
og
Wal
lPush
up
s
Wri
tin
gO
nB
oar
d
YoY
o
Tim
e (s
ec)
Total Time
Average Clip Duration
Figure 4. Total time of videos for each class is illustrated using the blue bars. The average length of the clips for each action is depicted ingreen.
where X, Y and Z represent action class label,group and clip number respectively. For instance,v ApplyEyeMakeup g03 c04.avi corresponds tothe clip 4 of group 3 of action class ApplyEyeMakeup.
3. Experimental Results
We performed an experiment using bag of words ap-proach which is widely accepted as a standard action recog-nition method to provide baseline results on UCF101.
From each clip, we extracted Harris3D corners (usingthe implementation by [7]) and computed 162 dimensionalHOG/HOF descriptors for each. We clustered a randomlyselected set of 100,000 space-time interest points (STIP) us-ing k-means to build the codebook. The size of our code-book is k=4000 which is shown to yield good results overa wide range of datasets. The descriptors were assigned totheir closest video words using nearest neighbor classifier,and each clip was represented by a 4000-dimensional his-togram of its words. Utilizing a leave-one-group-out 25-fold cross validation scenario, a SVM was trained using
the histogram vectors of the training folds. We employed anonlinear multiclass SVM with histogram intersection ker-nel and 101 classes each representing one action. For test-ing, a similar histogram representation for the query videowas computed and classified using the trained SVM. Thismethod yielded an overall accuracy of 44.5%; The confu-sion matrix for all 101 actions is shown in Fig. 5.
The accuracy for the predefined action types are:Sports (50.54%), Playing Musical Instrument (37.42%),Human-Object Interaction (38.52%), Body-Motion Only(36.26%), Human-Human Interaction (44.14%). Sports ac-tions achieve the highest accuracy since performing sportstypically requires distinctive motions which makes the clas-sification easier. Moreover, the background in sports clipsare generally less cluttered compared to other action types.Unlike Sports Actions, Human-Object Interaction clips typ-ically have a highly cluttered background. Additionally, theinformative motions typically occupy a small portion of themotions in the clips which explains the low recognition ac-curacy of this action class.
Dataset Number of Actions Clips Background Camera Motion Release Year ResourceKTH [11] 6 600 Static Slight 2004 Actor Staged
Weizmann [3] 9 81 Static No 2005 Actor StagedUCF Sports [10] 9 182 Dynamic Yes 2009 TV, Movies
IXMAS [12] 11 165 Static No 2006 Actor StagedUCF11 [6] 11 1168 Dynamic Yes 2009 YouTubeHOHA [7] 12 2517 Dynamic Yes 2009 Movies
Olympic [8] 16 800 Dynamic Yes 2010 YouTubeUCF50 [9] 50 6681 Dynamic Yes 2010 YouTube
HMDB51 [5] 51 6766 Dynamic Yes 2011 Movies, YouTube, WebUCF101 101 13320 Dynamic Yes 2012 YouTube
Table 2. Summary of Major Action Recognition Datasets
We recommend a 25-fold cross validation experimentalsetup using all the videos in the dataset to keep consistencyof the reported tests on UCF101; the baseline results pro-vided in this section were computed using the same sce-nario.
4. Related DatasetsUCF Sports, UCF11, UCF50 and UCF101 are the four
action datasets compiled by UCF in chronological order;each one includes its precursor. We made two minor mod-ifications in the portion of UCF101 which includes UCF50videos: the number of groups is fixed to 25 for all the ac-tions, and each group includes up to 7 clips. Table 2 showsa list of existing action recognition datasets with detailedcharacteristics of each. Note that UCF101 is remarkablylarger than the rest.
5. ConclusionWe introduced UCF101 which is the most challeng-
ing dataset for action recognition compared to the exist-ing ones. It includes 101 action classes and over 13k clipswhich makes it outstandingly larger than other datasets.UCF101 is composed of unconstrained videos downloadedfrom YouTube which feature challenges such as poor light-ing, cluttered background and severe camera motion. Weprovided baseline action recognition results on this newdataset using standard bag of words method with overallaccuracy of 44.5%.
References[1] K-lite codec package. http://codecguide.com/. 4[2] Youtube. http://www.youtube.com/. 4[3] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri.
Actions as space-time shapes, 2005. International Confer-ence on Computer Vision (ICCV). 2, 6
[4] G. Johansson, S. Bergstrom, and W. Epstein. Perceivingevents and objects, 1994. Lawrence Erlbaum Associates. 2
[5] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre.Hmdb: A large video database for human motion recogni-tion, 2011. International Conference on Computer Vision(ICCV). 2, 6
[6] J. Liu, J. Luo, and M. Shah. Recognizing realistic actionsfrom videos in the wild, 2009. IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR). 2, 6
[7] M. Marszaek, I. Laptev, and C. Schmid. Actions in context,2009. IEEE Conference on Computer Vision and PatternRecognition (CVPR). 2, 5, 6
[8] J. Niebles, C. Chen, and L. Fei-Fei. Modeling temporalstructure of decomposable motion segments for activity clas-sication, 2010. European Conference on Computer Vision(ECCV). 2, 6
[9] K. Reddy and M. Shah. Recognizing 50 human action cat-egories of web videos, 2012. Machine Vision and Applica-tions Journal (MVAP). 2, 6
[10] M. Rodriguez, J. Ahmed, and M. Shah. Action mach: Aspatiotemporal maximum average correlation height lter foraction recognition, 2008. IEEE Conference on ComputerVision and Pattern Recognition (CVPR). 2, 6
[11] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human ac-tions: A local svm approach, 2004. International Conferenceon Pattern Recognition (ICPR). 2, 6
[12] D. Weinland, E. Boyer, and R. Ronfard. Action recognitionfrom arbitrary views using 3d exemplars, 2007. InternationalConference on Computer Vision (ICCV). 2, 6
Archery
Baseball Pitch
Basketball Dunk
Biking
Bowling
Boxing Speed Bag
Clean and Jerk
Cricket Bowling
Diving
Field Hockey Penalty
Frisbee Catch
Golf Swing
High Jump
Horse Riding
Javelin Throw
Long Jump
Pole Vault
Punch
Rowing
Skiing
Sky Diving
Still Rings
Surfing
Tennis Swing
Uneven Bars
Drumming
Playing Piano
Playing Violin
Playing Daf
Playing Flute
Apply Eye Makeup
Blow Dry Hair
Cutting In Kitchen
Hula Hoop
Jump Rope
Mixing Batter
Nun chucks
Shaving Beard
Soccer Juggling
Writing On Board
Baby Crawling
BodyWeight Squats
Handstand Walking
Lunges
Push Ups
Rope Climbing
Tai Chi
Walking with a Dog
Haircut
Military Parade
Balance Beam
Basketball Shooting
Bench Press
Billiards Shot
Boxing Punching Bag
Breaststroke
Cliff Diving
Cricket Shot
Fencing
Floor Gymnastics
Front Crawl
Hammer Throw
Horse Race
Ice Dancing
Kayaking
Parallel Bars
Pommel Horse
Rafting
Shotput
Skijet
Soccer Penalty
SumoWrestling
Table Tennis Shot
Throw Discus
Volleyball Spiking
Playing Guitar
Playing Tabla
Playing Cello
Playing Dhol
Playing Sitar
Apply Lipstick
Brushing Teeth
Hammering
Juggling Balls
Knitting
Mopping Floor
Pizza Tossing
Skate Boarding
Typing
Yo Yo
Blowing Candles
Handstand Pushups
Jumping Jack
Pull Ups
Rock Climbing Indoor
Swing
Trampoline Jumping
Arc
hery
Bas
ebal
l Pitc
h
Bas
ketb
all D
unk
Bik
ing
Bow
ling
Box
ing
Spee
d B
ag
Cle
an a
nd Je
rk
Cric
ket B
owlin
g
Div
ing
Fiel
d H
ocke
y Pe
nalty
Fris
bee
Cat
ch
Gol
f Sw
ing
Hig
h Ju
mp
Hor
se R
idin
g
Jave
lin T
hrow
Long
Jum
p
Pole
Vau
lt
Punc
h
Row
ing
Skiin
g
SkyD
ivin
g
Still
Rin
gs
Surfi
ng
Tenn
is S
win
g
Une
ven
Bar
s
Dru
mm
ing
Play
ing
Pian
o
Play
ing
Vio
lin
Play
ing
Daf
Play
ing
Flut
e
App
ly E
ye M
akeu
p
Blo
w D
ry H
air
Cut
ting
In K
itche
n
Hul
a H
oop
Jum
p R
ope
Mix
ing
Bat
ter
Nun
chu
cks
Shav
ing
Bea
rd
Socc
er Ju
gglin
g
Writ
ing
On
Boa
rd
Bab
y C
raw
ling
Bod
y W
eigh
t Squ
ats
Han
dsta
nd W
alki
ng
Lung
es
Push
Ups
Rop
e C
limbi
ng
Tai C
hi
Wal
king
with
a D
og
Hai
rcut
Mili
tary
Par
ade
Bal
ance
Bea
m
Bas
ketb
all S
hoot
ing
Ben
ch P
ress
Bill
iard
s Sho
t
Box
ing
Punc
hing
Bag
Bre
asts
troke
Clif
f Div
ing
Cric
ket S
hot
Fenc
ing
Floo
r Gym
nast
ics
Fron
t Cra
wl
Ham
mer
Thr
ow
Hor
se R
ace
Ice
Dan
cing
Kay
akin
g
Para
llel B
ars
Pom
mel
Hor
se
Raf
ting
Shot
put
Skije
t
Socc
er P
enal
ty
Sum
o W
rest
ling
Tabl
e Te
nnis
Sho
t
Thro
w D
iscu
s
Volle
ybal
l Spi
king
Play
ing
Gui
tar
Play
ing
Tabl
a
Play
ing
Cel
lo
Play
ing
Dho
l
Play
ing
Sita
r
App
ly L
ipst
ick
Bru
shin
g Te
eth
'Ham
mer
ing
Jugg
ling
Bal
ls
Kni
tting
Mop
ping
Flo
or
Pizz
a To
ssin
g
Skat
e B
oard
ing
Typi
ng
Yo Y
o
Blo
win
g C
andl
es
Han
dsta
nd P
ushu
ps
Jum
ping
Jack
Pull
Ups
Roc
k C
limbi
ng In
door
Swin
g
Tram
polin
e Ju
mpi
ng
Band Marching
Head Massage
Salsa Spins
Wall Pushups
Ban
d M
arch
ing
Hea
d M
assa
ge
Sals
a Sp
ins
Wal
l Pus
hups
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 5. Confusion table of baseline action recognition results using bag of words approach on UCF101. The drawn lines separate differenttypes of actions; 1-50: Sports, 51-60: Playing Musical Instrument, 61-80: Human-Object Interaction, 81-96: Body-Motion Only, 97-101:Human-Human Interaction.