Speech2Action: Cross-modal Supervision for Action...

Speech2Action Cross-modal Supervision for Action Recognition

Arsha Nagrani12 Chen Sun2 David Ross2

Rahul Sukthankar2 Cordelia Schmid2 Andrew Zisserman13

1VGG Oxford 2Google Research 3DeepMind

httpswwwrobotsoxacuk˜vggresearchspeech2action

Caption Hello itrsquos meSpeech2Actionclassifier

[answers] phone Hello itrsquos me

[answers] phone Thanks for calling so soon [answers] phone Hello Dad are you still there

action dialogue

action dialogue action dialogue

Unlabelled videos

She knows hes right Janes cell RINGS She lets it ring

again then answers it

JANE(into phone)

Hello itrsquos me

Movie screenplays

Weak label [answer] phone

Figure 1 Weakly Supervised Learning of Actions from Speech Alone The co-occurrence of speech and scene descriptions in moviescreenplays (text) is used to learn a Speech2Action model that predicts actions from transcribed speech alone Weak labels for visualactions can then be obtained by applying this model to the speech in a large unlabelled set of movies

Abstract

Is it possible to guess human action from dialoguealone In this work we investigate the link between spokenwords and actions in movies We note that movie screen-plays describe actions as well as contain the speech ofcharacters and hence can be used to learn this correla-tion with no additional supervision We train a BERT-based Speech2Action classifier on over a thousandmovie screenplays to predict action labels from transcribedspeech segments

We then apply this model to the speech segments of alarge unlabelled movie corpus (188M speech segmentsfrom 288K movies) Using the predictions of this modelwe obtain weak action labels for over 800K video clips Bytraining on these video clips we demonstrate superior ac-tion recognition performance on standard action recogni-tion benchmarks without using a single manually labelledaction example

1 Introduction

Often you can get a sense of human activity in a movieby listening to the dialogue alone For example the sen-tence Look at that spot over there is an indication thatsomebody is pointing at something Similarly the wordsHello thanks for calling is a good indication that some-

body is speaking on the phone Could this be a valuablesource of information for learning good action recognitionmodels

Obtaining large scale human labelled video datasets totrain models for visual action recognition is a notoriouslychallenging task While large datasets such as Kinetics [20]or Moments in Time [30] consisting of individual shortclips (eg 10s) are now available these datasets come atformidable human cost and effort Furthermore many suchdatasets suffer from heavily skewed distributions with longtails ndash ie it is difficult to obtain manual labels for rare orinfrequent actions [15]

Recently a number of works have creatively identifiedcertain domains of videos such as narrated instructionalvideos [28 39 52] and lifestyle vlogs [12 18] that are avail-able in huge numbers (eg on YouTube) and often containnarration with the explicit intention of explaining the visualcontent on screen In these video domains there is a directlink between the action being performed and the speechaccompanying the video ndash though this link and the visualsupervision it provides can be quite weak and lsquonoisyrsquo as thespeech may refer to previous or forthcoming visual eventsor be about something else entirely [28]

In this paper we explore a complementary link betweenspeech and actions in the more general domain of moviesand TV shows (not restricted to instructional videos andvlogs) We ask is it possible given only a speech sentenceto predict whether an action is happening and if so what

1

the action is While it appears that in some cases the speechis correlated with action ndash lsquoRaise your glasses to rsquo in themore general domain of movies and TV shows it is morelikely that the speech is completely uncorrelated with theaction ndash lsquoHow is your day goingrsquo Hence in this workwe explicitly learn to identify when the speech is discrimi-native While the supervision we obtain from the speechndashaction correlation is still noisy we show that at scale it canprovide sufficient weak supervision to train visual classi-fiers (see Fig 1)

Luckily we have a large amount of literary content atour disposal to learn this correlation between speech andactions Screenplays can be found for hundreds of moviesand TV shows and contain rich descriptions of the identi-ties of people their actions and interactions with one an-other and their dialogue Early work has attempted to alignthese screenplays to the videos themselves and use thatas a source of weak supervision [2 9 23 26] Howeverthis is challenging due to the lack of explicit correspon-dence between scene elements in video and their textualdescriptions in screenplays [2] and notwithstanding align-ment quality is also fundamentally limited in scale to theamount of aligned movie screenplays available Instead welearn from unaligned movie screenplays We first learn thecorrelation between speech and actions from written mate-rial alone and use this to train a Speech2Action classi-fier This classifier is then applied to the speech in an un-labelled unaligned set of videos to obtain visual samplescorresponding to the actions confidently predicted from thespeech (Fig 1) In this manner the correlations can provideus with an effectively infinite source of weak training datasince the audio is freely available with movies

Concretely we make the following four contributions(i) We train a Speech2Action model from literaryscreenplays and show that it is possible to predict certainactions from transcribed speech alone without the need forany manual labelling (ii) We apply the Speech2Actionmodel to a large unlabelled corpus of videos to obtainweak labels for video clips from the speech alone (iii) Wedemonstrate that an action classifier trained with these weaklabels achieves state of the art results for action classifica-tion when fine-tuned on standard benchmarks compared toother weakly superviseddomain transfer methods (iv) Fi-nally and more interestingly we evaluate the action clas-sifier trained only on these weak labels with no fine-tuningon the mid and tail classes from the AVA dataset [15] inthe zero-shot and few-shot setting and show a large boostover fully supervised performance for some classes withoutusing a single manually labelled example

2 Related WorksAligning Screenplays to Movies A number of works haveexplored the use of screenplays to learn and automatically

annotate character identity in TV series [6 10 31 36 40]Learning human actions from screenplays has also been at-tempted [2 9 23 26 27] Crucially however all theseworks rely on aligning these screenplays to the actual videosthemselves often using the speech (as subtitles) to providecorrespondences However as noted by [2] obtaining su-pervision for actions in this manner is challenging due tothe lack of explicit correspondence between scene elementsin video and their textual descriptions in screenplays

Apart from the imprecise temporal localization inferredfrom subtitles correspondences a major limitation is thatthis method is not scalable to all movies and TV showssince screenplays with stage directions are simply notavailable at the same order of magnitude Hence previousworks have been limited to a small scale no more than tensof movies or a season of a TV series [2 9 23 26 27] Asimilar argument can be applied to works that align booksto movies [41 53] In contrast we propose a method thatcan exploit the richness of information in a modest numberof screenplays and then be applied to a virtually limitlessset of edited video material with no alignment or manualannotation required

Supervision for Action Recognition The benefits oflearning from large scale supervised video datasets for thetask of action recognition are well known with the intro-duction of datasets like Kinetics [20] spurring the develop-ment of new network architectures yielding impressive per-formance gains eg [4 11 42 44 45 48] However asdescribed in the introduction such datasets come with anexorbitant labelling cost Some work has attempted to re-duce this labeling effort through heuristics [51] (although ahuman annotator is required to clean up the final labels) orby procuring weak labels in the form of accompanying metadata such as hashtags [13]

There has also been a recent growing interest in usingcross-modal supervision from the audio streams readilyavailable with videos [1 21 32 33 50] Such methodshowever focus on non-speech audio eg lsquoguitar playingrsquothe lsquothudrsquo of a bouncing ball or the lsquocrashrsquo of waves atthe seaside rather the transcribed speech As discussed inthe introduction transcribed speech is used only in certainnarrow domains eg instruction videos [28 39 52] andlifestyle vlogs [12 18] while in contrast to these works wefocus on the domain of movies and TV shows (where thelink between speech and actions is less explicit) Furthersuch methods use most or all the speech accompanying avideo to learn a better overall visual embedding whereaswe note that often the speech is completely uninformativeof the action Hence we first learn the correlation betweenspeech and actions from written material and then applythis knowledge to an unlabelled set of videos to obtainvideo clips that can be used directly for training

3 Speech2Action ModelIn this section we describe the steps in data prepa-

ration data mining and learning required to train theSpeech2Action classifier from a large scale dataset ofscreenplays We then assess its performance in predictingvisual actions from transcribed speech segments

31 The IMSDb Dataset

Movie screenplays are a rich source of data that con-tain both stage directions (lsquoAndrew walked over to open thedoor) and the dialogues spoken by the characters (lsquoPleasecome inrsquo) Since stage directions often contain described ac-tions we use the co-occurrence of dialogue and stage direc-tions in screenplays to learn the relationship between lsquoac-tions and dialogue (see Fig 1) In this work we use a cor-pus of screenplays extracted from IMSDb (wwwimsdbcom) In order to get a wide variety of different actions(lsquopushrsquo and lsquokickrsquo as well as lsquokissrsquo and lsquohugrsquo) we usescreenplays covering a range of different genres1 In totalour dataset consists of 1070 movie screenplays (statisticsof the dataset can be seen in Table 1) We henceforth referto this dataset as the IMSDb datasetScreenplay Parsing While screenplays (generally) followa standardized format for their parts (eg stage directiondialogue location timing information etc) they can bechallenging to parse due to discrepancies in layout and for-mat We follow the grammar created by Winer et al [46]which is based on lsquoThe Hollywood Standardrsquo [34] to parsethe scripts and separate out various screenplay elementsThe grammar provided by [46] parses scripts into the fol-lowing four different elements (1) Shot Headings (2) StageDirections (which contain mention of actions) (3) Dialogueand (4) Transitions More details are provided in Sec A1of the Appendix

In this work we extract only (2) Stage Directions and(3) Dialogue We extract over 500K stage directions andover 500K dialogue utterances (see Table 1) It is impor-tant to note that since screenplay parsing is done using anautomatic method and sometimes hand-typed screenplaysfollow completely non-standard formats this extraction isnot perfect A quick manual inspection of 100 randomlyextracted dialogues shows that around 85 of these are ac-tually dialogue with the rest being stage directions that havebeen wrongly labelled as dialogueVerb Mining the Stage Directions Not all actions willbe correlated with speech ndash eg actions like lsquositting andlsquostanding are difficult to distinguish based on speech alonesince they occur commonly with all types of speech Hence

1Action Adventure Animation Biography Comedy Crime DramaFamily Fantasy Film-Noir History Horror Music Musical Mystery Ro-mance Sci-Fi Short Sport Thriller War Western

our first endeavour is to automatically determine verbs ren-dered lsquodiscriminativersquo by speech alone For this we usethe IMSDb dataset described above We first take all thestage directions in the dataset and break up each sentenceinto clean word tokens (devoid of punctuation) We thendetermine the part of speech (PoS) tag for each word us-ing the NLTK toolkit [25] and obtain a list of all the verbspresent Verbs occurring fewer than 50 times (includesmany spelling mistakes) or those occurring too frequentlyie the top 100 most frequent verbs (these are stop wordslike lsquobersquo etc) are removed For each verb we then grouptogether all the conjugations and word forms for a particularword stem (eg the stem run can appear in many differentforms ndash running ran runs etc) using the manually createdverb conjugations list from the UPenn XTag project2 Allsuch verb classes are then used in training a BERT-basedspeech to action classifier described next

32 BERT-based Speech Classifier

Each stage direction is then parsed for verbs belonging tothe verb classes identified above We obtain paired speech-action data using proximity in the movie screenplays as aclue Hence the nearest speech segment to the stage di-rection (as illustrated in Fig 1) is assigned a label for ev-ery verb in the stage direction (more examples in the Ap-pendix Fig 7) This gives us a dataset of speech sentencesmatched to verb labels As expected this is a very noisydataset Often the speech has no correlation with the verbclass it is assigned to and the same speech segment can beassigned to many different verb classes To learn the corre-lation between speech and action we train a classifier with850 movies and use the remaining ones for validation Theclassifier used is a pretrained BERT [8] model with an addi-tional classification layer finetuned on the dataset of speechpaired with weak lsquoactionrsquo labels Exact model details aredescribed belowImplementation Details The model used is BERT-LargeCased with Whole-Word Masking (L=24 H=1024 A=16Total Parameters=340M) [8] pretrained only on Englishdata (BooksCorpus (800M words [53]) and the Wikipediacorpus (2500M words)) since the IMSDb dataset consistsonly of movie screenplays in English3 We use Word-Piece embeddings [47] with a 30 000 token vocabularyThe first token of every sequence is always a special clas-sification token ([CLS]) We use the final hidden vectorC isin RH corresponding to the first input token ([CLS])as the aggregate representation The only new parame-ters introduced during fine-tuning are classification layerweights W isin RKtimesH where K is the number of classesWe use the standard cross-entropy loss with C and W

2httpwwwcisupennedu˜xtag3The model can be found here httpsgithubcom

google-researchbert

movies scene descriptions speech segs sentences words unique words genres

1070 539827 595227 2570993 21364357 590959 22Table 1 Statistics of the IMSDb dataset of movie screenplays This dataset is used to learn the correlation between speech and verbsWe use 850 screenplays for training and 220 for validation Statistics for sentences and words are from the entire text of the screenplays

Hello itrsquos me One more kiss To usMay I have the number for Dr George Shannan Give me a kiss Raise your glasses to Charlie

phone Honey I asked you not to call unless what why kiss Good night my darling drink Heres a toasthey itrsquos me I love you my darling You want some waterHello itrsquos me Noone had ever kissed me there before Drink deep and liveHello Goodnight angel my sweet boy Drink up its party timeShes a beautiful dancer So well drop Rudy off at the bus Officer Van Dorn is right down that hallWaddaya say you wanna dance Ill drive her OK Print that one

dance Come on Ill take a break and well all dance drive just parking it out of the way point the Metroplitan Museum of Art is right thereLadies and Gentlemen the first dance all you have to do is drop me off at the bank Over thereExcuse me would you care for this dance Wait down the road And herHattie do you still dance He drove around for a long long time driving The one with the black spot

Figure 2 Examples of the top ranked speech samples for six verb categories Each block shows the action verb on the left and thespeech samples on the right All speech segments are from the validation set of the IMSDb dataset of movie screenplays

ie log(softmax(WTC)) We use a batch size of 32 andfinetune the model end-to-end on the IMSDb dataset for100000 iterations using the Adam solver with a learningrate of 5times 105Results We evaluate the performance of our model on the220 movie screenplays in the val set We plot the precision-recall curves using the softmax scores obtained from theSpeech2Action model (Fig 6 in the Appendix) Onlythose verbs that achieve an average precision (AP) higherthan 001 are inferred to be correlated with speech Thehighest performing verb classes are lsquophonersquo lsquoopenrsquo andlsquorunrsquo whereas verb classes like lsquofishingrsquo and lsquodigrsquo achieve avery low average precision We finally conclude that there isa strong correlation for 18 verb classes4 Qualitative exam-ples of the most confident predictions (using softmax scoreas a measure of confidence) for 6 verb classes can be seenin Fig 2 We note here that we have learnt the correlationbetween action verb and speech from the movie screenplaysusing a purely data-driven method The key assumption isthat if there is a consistent trend of a verb appearing in thescreenplays before or after a speech segment and our modelis able to exploit this trend to minimise a classification ob-jective we infer that the speech is correlated with the actionverb Because the evaluation is performed purely on thebasis of the proximity of speech to verb class in the stagedirection of the movie screenplay it is not a perfect groundtruth indication of whether an action will actually be per-formed in a video (which is impossible to say only fromthe movie scripts) We use the stage directions in this caseas pseudo ground truth ie if the stage direction containsan action and the actor then says a particular sentence weinfer that these two must be related As a sanity check we

4The verb classes are lsquoopenrsquo lsquophonersquo lsquokissrsquo lsquohugrsquo lsquopushrsquo lsquopointrsquolsquodancersquo lsquodrinkrsquo lsquorunrsquo lsquocountrsquo lsquocookrsquo lsquoshootrsquo lsquodriversquo lsquoenterrsquo lsquofallrsquo lsquofol-lowrsquo lsquohitrsquo lsquoeatrsquo

also manually annotate some videos in order to better assessthe performance of the Speech2Action model This isdescribed in Sec 423

4 Mining Videos for Action Recognition

Now that we have learned the Speech2Action modelto map from transcribed speech to actions (from text alone)in this section we demonstrate how this can be applied tovideo We use the model to automatically mine video ex-amples from large unlabelled corpora (the corpus is de-scribed in Sec 41) and assign them with weak labelsfrom the Speech2Action model prediction Armed withthis weakly labelled data we then train models directlyfor the downstream task of visual action recognition De-tailed training and evaluation protocols for the mining aredescribed in the following sections

41 Unlabelled Data

In this work we apply the Speech2Action model toa large internal corpus of movies and TV shows The cor-pus consists of 222 855 movies and TV show episodes Forthese videos we use the closed captions (note that this canbe obtained from the audio track directly using automaticspeech recognition) The total number of closed captionsfor this corpus is 188 210 008 which after dividing intosentences gives us a total of 390 791 653 (almost 400M)sentences While we use this corpus in our work we wouldlike to stress here that there is no correlation between thetext data used to train the Speech2Action model andthis unlabelled corpus (other than both belonging to themovie domain) and such a model can be applied to anyother corpus of unlabelled edited film material

42 Obtaining Weak Labels

In this section we describe how we obtain weak actionlabels for short clips from the speech alone We do this intwo ways (i) using the Speech2Action model and (ii)using a simple keyword spotting baseline described below

421 Using Speech2Action

The Speech2Action model is applied to a single sen-tence of speech and the prediction is used as a weak label ifthe confidence (softmax score) is above a certain thresholdThe threshold is obtained by taking the confidence value ata precision of 03 on the IMSDb validation set with somemanual adjustments for the classes of lsquophonersquo lsquorunrsquo andlsquoopenrsquo (since these classes have a much higher recall weincrease the threshold in order to prevent a huge imbalanceof retrieved samples) More details are provided in the Ap-pendix Sec A2 We then extract the visual frames for a 10second clip centered around the midpoint of the timeframespanned by the caption and assign the Speech2Actionlabel as the weak label for the clip Ultimately we suc-cessfully end up mining 837 334 video clips for 18 actionclasses While this is a low yield we still end up with a largenumber of mined clips greater than the manually labelledKinetics dataset [20] (600K)

We also discover that the verb classes that have high cor-relation with speech in the IMSDb dataset tend to be infre-quent or rare actions in other datasets [15] ndash as shown inFig 3 we obtain two orders of magnitude more data forcertain classes in the AVA training set [15] Qualitative ex-amples of mined video clips with action labels can be seenin Fig 4 Note how we are able to retrieve clips with a widevariety in background and actor simply from the speechalone Refer to Fig 10 in the Appendix for more examplesshowing diversity in objects and viewpoint

422 Using a Keyword Spotting Baseline

In order to validate the efficacy of our Speech2Actionmodel trained on movie screenplays we also compare to asimple keyword spotting baseline This involves searchingfor the action verb in the speech directly ndash a speech segmentlike lsquoWill you eat nowrsquo is directly assigned the label lsquoeatrsquoThis itself is a very powerful baseline eg speech segmentssuch as lsquoWill you dance with mersquo are strongly indicative ofthe action lsquodancersquo To implement this baseline we searchfor the presence of the action verb (or its conjugations) inthe speech segment directly and if the verb is present in thespeech we assign the action label to the video clip directlyThe fallacy of this method is that there is no distinctionbetween the different semantic meanings of a verb eg thespeech segment lsquoYoursquove missed the point entirelyrsquo will beweakly labelled with the verb lsquopointrsquo using this baseline

Figure 3 Distribution of training clips mined usingSpeech2Action We compare the distribution of minedclips to the number of samples in the AVA training set Althoughthe mined clips are noisy we are able to obtain far more in somecases up to two orders of magnitude more training data (note thelog scale in the x-axis)

dance phone kiss drive eat drink run point hit shoot

42 68 18 41 27 51 83 52 18 27

Table 2 Number of true positives for 100 randomly re-trieved samples for 10 classes These estimates are obtainedthrough manual inspection of video clips that are labelled withSpeech2Action While the true positive rate for some classesis low the other samples still contain valuable information for theclassifier For example although there are only 18 true samplesof lsquokissrsquo many of the other videos have two people with their lipsvery close together or even if they are not lsquoeatingrsquo strictly manytimes they are holding food in their hands

which is indicative of a different semantic meaning to thephysical action of lsquopointingrsquo Hence as we show in theresults this baseline performs poorly compared to ourSpeech2Action mining method (Tables 4 and 3) Moreexamples of speech labelled using this keyword spottingbaseline can be seen in Table 5 in the Appendix

423 Manual Evaluation of Speech2Action

We now assess the performance of Speech2Action ap-plied to videos Given a speech segment we check whethera prediction made by the model on the speech translates tothe action being performed visually in the frames alignedto the speech To assess this we do a manual inspection ofa random set of 100 retrieved video clips for 10 of the verbclasses and report the true positive rate (number of clips forwhich the action is visible) in Table 2 We find that a sur-prising number of samples actually contain the action dur-ing the time frame of 10 seconds with some classes noisierthan others The high purity of the classes lsquorunrsquo and lsquophonersquocan be explained by the higher thresholds used for mining

why are you green and dancing

and nandita lets see your sita dance

how can you not dance

you dance yank

love you gave me what i wanted

i wish we could stay like this forever

give me a big kissthen you must kiss me now

these drinks are strong

ah i am the one sipping the champagne now

after two belvedere martinis straight up with twists

thatrsquos why imsitting here day drinking in the corner

see that up there look at that right there

and that one there is that it over therego go go run faster baby donrsquot move hey chase chase

hell of a right hook

you hit like a b you almost hit me m

donrsquot hit

DANCE

HIT

this chicken is very tasty

have you ever had szechwan cuisine before

this food is so good

are ronnie and nancy on the cover your menu

EAT

POINT

DRINK

RUN

please leave a message after the tone

pick up olegi am trying brother from other phone

yes i need jeff on his secure line

PHONE

KISS

Figure 4 Examples of clips mined automatically using the Speech2Action model applied to speech alone for 8 AVA classes Weshow only a single frame from each video Note the diversity in background actor and view point We show false positives for eat phoneand dance (last in each row enclosed in a red box) Expletives are censored More examples are provided in the Appendix

as explained in Sec 421 Common sources of false posi-tives are actions performed off screen or actions performedat a temporal offset (either much before or much after) thespeech segment We note that at no point do we ever ac-tually use any of the manual labels for training these arepurely for evaluation and as a sanity check

5 Action ClassificationNow that we have described our method to obtain weakly

labelled training data we train a video classifier with theS3D-G [48] backbone on these noisy samples for the taskof action recognition We first detail the training and testingprotocols and then describe the datasets used in this work

51 Evaluation Protocol

We evaluate our video classifier for the task of actionclassification in the following two waysFirst we follow the typical procedure adopted in the videounderstanding literature [4] pre-training on a large cor-pus of videos weakly labelled using our Speech2Actionmodel followed by fine-tuning on the training split of a la-beled target dataset (test bed) After training we evaluatethe performance on the test set of the target dataset In this

work we use HMDB-51 [22] and compare to other state ofthe art methods on this dataset We also provide results forthe UCF101 dataset [37] in Sec C of the AppendixSecond and perhaps more interestingly we apply ourmethod by training a video classifier on the mined videoclips for some action classes and evaluating it directly onthe test samples of rare action classes in the target dataset(in this case we use the AVA dataset [15]) Note At thispoint we also manually verified that there is no overlapbetween the movies in the IMSDb dataset and the AVAdataset (not surprising since AVA movies are older andmore obscure these are movies that are freely available onYouTube) Here not a single manually labelled training ex-ample is used since there is no finetuning (we henceforthrefer to this as zero-shot5) We also report performancefor the few-shot learning scenario where we fine-tune ourmodel on a small number of labelled examples We notethat in this case we can only evaluate on the classes thatdirectly overlap with the verb classes in the IMSDb dataset

5In order to avoid confusion with the strict meaning of this term weclarify that in this work we use it to refer to the case where not a singlemanually labelled example is available for a particular class We do how-ever train on multiple weakly labelled examples

52 Datasets and Experimental Details

HMDB51 HMDB51 [22] contains 6766 realistic and var-ied video clips from 51 action classes Evaluation is per-formed using average classification accuracy over threetraintest splits from [17] each with 3570 train and 1530test videosAVA The AVA dataset [15] is collected by exhaustivelymanually annotating videos and exhibits a strong imbalancein the number of examples between the common and rareclasses Eg a common action like lsquostandrsquo has 160K train-ing and 43K test examples compared to lsquodriversquo (118K trainand 561 test) and lsquopointrsquo (only 96 train and 32 test) As aresult methods relying on full supervision struggle on thecategories in the middle and the end of the tail We evaluateon the 14 AVA classes that overlap with the classes presentin the IMDSDb dataset (all from the middle and tail) Whilethe dataset is originally a detection dataset we repurposeit simply for the task of action classification by assigningeach frame the union of labels from all bounding box an-notations We then train and test on samples from these 14action classes reporting per-class average precision (AP)Implementation Details We train the S3D with gating(S3D-G) [48] model as our visual classifier Following [48]we densely sample 64 frames from a video resize inputframes to 256 times 256 and then take random crops of size224 times 224 during training During evaluation we use allframes and take 224 times 224 center crops from the resizedframes Our models are implemented with TensorFlow andoptimized with a vanilla synchronous SGD algorithm withmomentum of 09 For models trained from scratch wetrain for 150K iterations with a learning rate schedule of102 103 and 104 dropping after 80K and 100K iterationsand for finetuning we train for 60K iterations using a learn-ing rate of 102Loss functions for training We try both the softmaxcross-entropy and per-class sigmoid loss and find that theperformance was relatively stable with both choices

53 Results

HMDB51 The results on HMDB51 can be seen in Table 3Training on videos labelled with Speech2Actions leadsto a significant 17 improvement over from-scratch train-ing For reference we also compare to other self-supervisedand weakly supervised works (note that these methods dif-fer both in architecture and training objective) We showa 14 improvement over previous self-supervised worksthat use only video frames (no other modalities) We alsocompare to Korbar et al [21] who pretrain using audio andvideo synchronisation on AudioSet DisInit [14] which dis-tills knowledge from ImageNet into Kinetics videos andsimply pretraining on ImageNet and then inflating 2D con-volutions to our S3D-G model [20] We improve over theseworks by 3-4 ndash which is impressive given that the latter

Method Architecture Pre-training Acc

ShuffleampLearn [29]983183 S3D-G (RGB) UCF101dagger [37] 358OPN [24] VGG-M-2048 UCF101dagger [37] 238ClipOrder [49] R(2+1)D UCF101dagger [37] 309Wang et al [43] C3D Kineticsdagger [37] 3343DRotNet [19]983183 S3D-G (RGB) Kineticsdagger 400DPC [16] 3DResNet18 Kineticsdagger 357CBT [38] S3D-G (RGB) Kineticsdagger 446

DisInit (RGB) [14] R(2+1)D-18 [42] Kineticslowastlowast 548Korbar et al [21] I3D (RGB) Kineticsdagger 530

- S3D-G (RGB) Scratch 412Ours S3D-G (RGB) KSB-mined 460Ours S3D-G (RGB) S2A-mined 581

Supervised pretraining S3D-G (RGB) ImageNet 547Supervised pretraining S3D-G (RGB) Kinetics 723

Table 3 Action classification results on HMDB51 Pre-trainingon videos labelled with Speech2Action leads to a 17 im-provement over training from scratch and also outperforms previ-ous self-supervised and weakly supervised works KSB-minedvideo clips mined using the keyword spotting baseline S2A-mined video clips mined using the Speech2Action modeldaggervideos without labels videos with labels distilled from Ima-geNet When comparing to [21] we report the number achievedby their I3D (RGB only) model which is the closest to our archi-tecture For 983183 we report the reimplementations by [38] using theS3D-G model (same as ours) For the rest we report performancedirectly from the original papers

two methods rely on access to a large-scale manually la-belled image dataset [7] whereas ours relies only on 1000unlabelled movie scripts Another point of interest (and per-haps an unavoidable side-effect of this stream of self- andweak-supervision) is that while all these previous methodsdo not use labels they still pretrain on the Kinetics datawhich has been carefully curated to cover a wide diversity ofover 600 different actions In contrast we mine our trainingdata directly from movies without the need for any manuallabelling or careful curation and our pretraining data wasmined for only 18 classesAVA-scratch The results on AVA for models trained fromscratch with no pretraining can be seen in Table 4 (top 4rows) We compare the following training with the AVAtraining examples (Table 4 top row) training only with ourmined examples and training jointly with both For 8 out of14 classes we exceed fully supervised performance withouta single AVA training example in some cases (lsquodriversquo andlsquophonersquo) almost by 20AVA-finetuned We also show results for pre-training onSpeech2Action mined clips first and then fine-tuningon a gradually increasing number of AVA labelled trainingsamples per class (Table 4 bottom 4 rows) Here we keepall the weights from the fine-tuning including the classifica-tion layer weights for initialisation and fine-tune only fora single epoch With 50 training samples per class we ex-ceed fully supervised performance for all classes (except for

Data Per-Class APdrive phone kiss dance eat drink run point open hit shoot push hug enter

AVA (fully supervised) 063 054 022 046 067 027 066 002 049 062 008 009 029 014

KS-baseline dagger 067 020 012 053 067 018 037 000 033 047 005 003 010 002S2A-mined (zero-shot) 083 079 013 055 068 030 063 004 052 054 018 004 007 004S2A-mined + AVA 084 083 018 056 075 040 074 005 056 064 023 007 017 004

AVA (few-shot)-20 082 083 022 055 069 033 064 004 051 059 020 006 019 013AVA (few-shot)-50 082 085 026 056 070 037 069 004 052 065 021 006 019 015AVA (few-shot)-100 084 086 030 058 071 039 075 005 058 073 025 013 027 015

AVA (all) 086 089 034 058 078 042 075 003 065 072 026 013 036 016Table 4 Per-class average precision for 14 AVA mid and tail classes These actions occur rarely and hence are harder to get manualsupervision for For 8 of the 14 classes we exceed fully supervised performance without a single manually labelled training example(highlighted in pink best viewed in colour) S2A-mined Video clips mined using Speech2Action dagger Keyword spotting baselineFirst 4 rows models are trained from scratch Last 4 rows we pre-train on video clips mined using Speech2Action

after you stay close behind me now

just follow my lead follow me quick

FOLLOWtwo quarters three dimes one nickel two pennies

thirty six thousand four hundred five hundred

20 dollar four centstwenty four thousand four hundred

COUNT

Figure 5 Examples of clips mined for more abstract actions These are actions that are not present in standard datasets like HMDB51or AVA but are quite well correlated with speech Our method is able to automatically mine clips weakly labelled with these actions fromunlabelled data

lsquohugrsquo and lsquopushrsquo) compared to training from scratch Theworst performance is for the class lsquohugrsquo ndash lsquohugrsquo and lsquokissrsquoare often confused as the speech in both cases tends to besimilar ndash rsquoI love yoursquo A quick manual inspection showsthat most of the clips are wrongly labelled as lsquokissrsquo whichis why we are only able to mine very few video clips for thisclass For completeness we also pretrain a model with theS2A mined clips (only 14 classes) and then finetune on AVAfor all 60 classes used for evaluation and get a 40 overallclassification acc vs 38 with training on AVA aloneMining Technique We also train on clips mined usingthe keyword spotting baseline (Table 4) For some classesthis baseline itself exceeds fully supervised performanceOur Speech2Action labelling beats this baseline for allclasses indeed the baseline does poorly for classes likelsquopointrsquo and lsquoopenrsquo ndash verbs which have many semantic mean-ings demonstrating that the semantic information learntfrom the IMSDb dataset is valuable However we note herethat it is difficult to measure performance quantitatively forthe class lsquopointrsquo due to idiosyncrasies in the AVA test set(wrong ground truth labels for very few test samples) andhence we show qualitative examples of mined clips in Fig4 We note that the baseline comes very close for lsquodancersquoand lsquoeatrsquo demonstrating that simple keyword matching onspeech can retrieve good training data for these actionsAbstract Actions By gathering data directly from thestage directions in movie screenplays our action labels are

post-defined (as in [12]) This is unlike the majority ofthe existing human action datasets that use pre-defined la-bels [3 15 30 35] Hence we also manage to mine exam-ples for some unusual or abstract actions which are quitewell correlated with speech such as lsquocountrsquo and lsquofollowrsquoWhile these are not present in standard action recognitiondatasets such as HMDB51 or AVA and hence cannot beevaluated numerically we show some qualitative examplesof these mined videos in Fig 5

6 ConclusionWe provide a new data-driven approach to obtain weak

labels for action recognition using speech alone With onlya thousand unaligned screenplays as a starting point weobtain weak labels automatically for a number of rare ac-tion classes However there is a plethora of literary ma-terial available online including plays and books and ex-ploiting these sources of text may allow us to extend ourmethod to predict other action classes including compositeactions of lsquoverbrsquo and lsquoobjectrsquo We also note that besides ac-tions people talk about physical objects events and scenesndash descriptions of which are also present in screenplays andbooks Hence the same principle used here could be appliedto mine videos for more general visual contentAcknowledgments Arsha is supported by a Google PhDFellowship We are grateful to Carl Vondrick for early dis-cussions

References[1] Relja Arandjelovic and Andrew Zisserman Look listen and

learn In Proceedings of the IEEE International Conferenceon Computer Vision pages 609ndash617 2017 2

[2] Piotr Bojanowski Francis Bach Ivan Laptev Jean PonceCordelia Schmid and Josef Sivic Finding actors and ac-tions in movies In Proceedings of the IEEE internationalconference on computer vision pages 2280ndash2287 2013 2

[3] Fabian Caba Heilbron Victor Escorcia Bernard Ghanemand Juan Carlos Niebles Activitynet A large-scale videobenchmark for human activity understanding In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition pages 961ndash970 2015 8

[4] Joao Carreira and Andrew Zisserman Quo vadis actionrecognition a new model and the Kinetics dataset In pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition pages 6299ndash6308 2017 2 6

[5] Luciano Del Corro Rainer Gemulla and Gerhard WeikumWerdy Recognition and disambiguation of verbs and verbphrases with syntactic and semantic pruning 2014 12

[6] Timothee Cour Benjamin Sapp Chris Jordan and BenTaskar Learning from ambiguously labeled images In 2009IEEE Conference on Computer Vision and Pattern Recogni-tion pages 919ndash926 IEEE 2009 2

[7] Jia Deng Wei Dong Richard Socher Li-Jia Li Kai Liand Li Fei-Fei Imagenet A large-scale hierarchical imagedatabase In 2009 IEEE conference on computer vision andpattern recognition pages 248ndash255 Ieee 2009 7

[8] Jacob Devlin Ming-Wei Chang Kenton Lee and KristinaToutanova Bert Pre-training of deep bidirectionaltransformers for language understanding arXiv preprintarXiv181004805 2018 3

[9] Olivier Duchenne Ivan Laptev Josef Sivic Francis Bachand Jean Ponce Automatic annotation of human actions invideo In 2009 IEEE 12th International Conference on Com-puter Vision pages 1491ndash1498 IEEE 2009 2

[10] Mark Everingham Josef Sivic and Andrew ZissermanldquoHello My name is Buffyrdquo ndash automatic naming of charac-ters in TV video In BMVC 2006 2

[11] Christoph Feichtenhofer Haoqi Fan Jitendra Malik andKaiming He Slowfast networks for video recognition InProceedings of the IEEE International Conference on Com-puter Vision pages 6202ndash6211 2019 2

[12] David F Fouhey Wei-cheng Kuo Alexei A Efros and Ji-tendra Malik From lifestyle vlogs to everyday interactionsIn Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition pages 4991ndash5000 2018 1 2 8

[13] Deepti Ghadiyaram Du Tran and Dhruv Mahajan Large-scale weakly-supervised pre-training for video action recog-nition In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition pages 12046ndash12055 20192

[14] Rohit Girdhar Du Tran Lorenzo Torresani and Deva Ra-manan Distinit Learning video representations without asingle labeled video ICCV 2019 7 14

[15] Chunhui Gu Chen Sun David A Ross Carl Von-drick Caroline Pantofaru Yeqing Li Sudheendra Vijaya-narasimhan George Toderici Susanna Ricco Rahul Suk-thankar Cordelia Schmid and Jitendra Malik AVA A video

dataset of spatio-temporally localized atomic visual actionsIn Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition pages 6047ndash6056 2018 1 2 5 67 8

[16] Tengda Han Weidi Xie and Andrew Zisserman Video rep-resentation learning by dense predictive coding In Proceed-ings of the IEEE International Conference on Computer Vi-sion Workshops 2019 7 14

[17] Haroon Idrees Amir R Zamir Yu-Gang Jiang Alex Gor-ban Ivan Laptev Rahul Sukthankar and Mubarak Shah TheTHUMOS challenge on action recognition for videos in thewild Computer Vision and Image Understanding 1551ndash232017 7

[18] Oana Ignat Laura Burdick Jia Deng and Rada MihalceaIdentifying visible actions in lifestyle vlogs arXiv preprintarXiv190604236 2019 1 2

[19] Longlong Jing and Yingli Tian Self-supervised spatiotem-poral feature learning by video geometric transformationsarXiv preprint arXiv181111387 2018 7 14

[20] W Kay J Carreira K Simonyan B Zhang C Hillier SVijayanarasimhan F Viola T Green T Back P Natsev MSuleyman and A Zisserman The Kinetics human actionvideo dataset CoRR abs170506950 2017 1 2 5 7

[21] Bruno Korbar Du Tran and Lorenzo Torresani Cooperativelearning of audio and video models from self-supervised syn-chronization In Advances in Neural Information ProcessingSystems pages 7763ndash7774 2018 2 7 14

[22] Hildegard Kuehne Hueihan Jhuang Estıbaliz GarroteTomaso Poggio and Thomas Serre Hmdb a large videodatabase for human motion recognition In 2011 Interna-tional Conference on Computer Vision pages 2556ndash2563IEEE 2011 6 7

[23] Ivan Laptev Marcin Marszałek Cordelia Schmid and Ben-jamin Rozenfeld Learning realistic human actions frommovies In IEEE Conference on Computer Vision amp PatternRecognition 2008 2

[24] Hsin-Ying Lee Jia-Bin Huang Maneesh Singh and Ming-Hsuan Yang Unsupervised representation learning by sort-ing sequences In Proceedings of the IEEE InternationalConference on Computer Vision pages 667ndash676 2017 714

[25] Edward Loper and Steven Bird Nltk the natural languagetoolkit arXiv preprint cs0205028 2002 3

[26] Marcin Marszałek Ivan Laptev and Cordelia Schmid Ac-tions in context In CVPR 2009-IEEE Conference on Com-puter Vision amp Pattern Recognition pages 2929ndash2936 IEEEComputer Society 2009 2

[27] Antoine Miech Jean-Baptiste Alayrac Piotr BojanowskiIvan Laptev and Josef Sivic Learning from video and textvia large-scale discriminative clustering In Proceedings ofthe IEEE international conference on computer vision 20172

[28] Antoine Miech Dimitri Zhukov Jean-Baptiste AlayracMakarand Tapaswi Ivan Laptev and Josef SivicHowTo100M Learning a Text-Video Embedding byWatching Hundred Million Narrated Video Clips In Pro-ceedings of the IEEE international conference on computervision 2019 1 2

[29] Ishan Misra C Lawrence Zitnick and Martial Hebert Shuf-

fle and learn unsupervised learning using temporal orderverification In European Conference on Computer Visionpages 527ndash544 Springer 2016 7 14

[30] Mathew Monfort Alex Andonian Bolei Zhou Kandan Ra-makrishnan Sarah Adel Bargal Yan Yan Lisa BrownQuanfu Fan Dan Gutfreund Carl Vondrick et al Momentsin time dataset one million videos for event understandingIEEE transactions on pattern analysis and machine intelli-gence 2019 1 8

[31] Iftekhar Naim Abdullah Al Mamun Young Chol SongJiebo Luo Henry Kautz and Daniel Gildea Aligningmovies with scripts by exploiting temporal ordering con-straints In 2016 23rd International Conference on PatternRecognition (ICPR) pages 1786ndash1791 IEEE 2016 2

[32] Andrew Owens and Alexei A Efros Audio-visual sceneanalysis with self-supervised multisensory features In Pro-ceedings of the European Conference on Computer Vision(ECCV) pages 631ndash648 2018 2

[33] Andrew Owens Jiajun Wu Josh H McDermott William TFreeman and Antonio Torralba Ambient sound providessupervision for visual learning In European conference oncomputer vision pages 801ndash816 Springer 2016 2

[34] Christopher Riley The Hollywood standard the completeand authoritative guide to script format and style MichaelWiese Productions 2009 3 11

[35] Gunnar A Sigurdsson Gul Varol Xiaolong Wang AliFarhadi Ivan Laptev and Abhinav Gupta Hollywood inhomes Crowdsourcing data collection for activity under-standing In European Conference on Computer Visionpages 510ndash526 Springer 2016 8

[36] Josef Sivic Mark Everingham and Andrew Zisserman whoare you-learning person specific classifiers from video In2009 IEEE Conference on Computer Vision and PatternRecognition pages 1145ndash1152 IEEE 2009 2

[37] Khurram Soomro Amir Roshan Zamir and Mubarak ShahUcf101 A dataset of 101 human actions classes from videosin the wild arXiv preprint arXiv12120402 2012 6 7 1114

[38] Chen Sun Fabien Baradel Kevin Murphy and CordeliaSchmid Contrastive bidirectional transformer for temporalrepresentation learning arXiv preprint arXiv1906057432019 7 14

[39] Yansong Tang Dajun Ding Yongming Rao Yu ZhengDanyang Zhang Lili Zhao Jiwen Lu and Jie Zhou CoinA large-scale dataset for comprehensive instructional videoanalysis In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition pages 1207ndash12162019 1 2

[40] Makarand Tapaswi Martin Bauml and Rainer Stiefelhagenknock knock who is it probabilistic person identificationin tv-series In 2012 IEEE Conference on Computer Visionand Pattern Recognition pages 2658ndash2665 IEEE 2012 2

[41] Makarand Tapaswi Martin Bauml and Rainer StiefelhagenBook2movie Aligning video scenes with book chaptersIn The IEEE Conference on Computer Vision and PatternRecognition (CVPR) June 2015 2

[42] Du Tran Heng Wang Lorenzo Torresani Jamie Ray YannLeCun and Manohar Paluri A closer look at spatiotemporalconvolutions for action recognition In Proceedings of the

IEEE conference on Computer Vision and Pattern Recogni-tion pages 6450ndash6459 2018 2 7 14

[43] Jiangliu Wang Jianbo Jiao Linchao Bao Shengfeng HeYunhui Liu and Wei Liu Self-supervised spatio-temporalrepresentation learning for videos by predicting motion andappearance statistics In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition pages 4006ndash4015 2019 7 14

[44] Limin Wang Yuanjun Xiong Zhe Wang Yu Qiao DahuaLin Xiaoou Tang and Luc Van Gool Temporal segmentnetworks Towards good practices for deep action recogni-tion In ECCV 2016 2

[45] Xiaolong Wang Ross Girshick Abhinav Gupta and Kaim-ing He Non-local neural networks In CVPR 2018 2

[46] David R Winer and R Michael Young Automated screen-play annotation for extracting storytelling knowledge InThirteenth Artificial Intelligence and Interactive Digital En-tertainment Conference 2017 3 11

[47] Yonghui Wu Mike Schuster Zhifeng Chen Quoc V LeMohammad Norouzi Wolfgang Macherey Maxim KrikunYuan Cao Qin Gao Klaus Macherey et al Googlersquosneural machine translation system Bridging the gap be-tween human and machine translation arXiv preprintarXiv160908144 2016 3

[48] Saining Xie Chen Sun Jonathan Huang Zhuowen Tu andKevin Murphy Rethinking spatiotemporal feature learningSpeed-accuracy trade-offs in video classification In Pro-ceedings of the European Conference on Computer Vision(ECCV) pages 305ndash321 2018 2 6 7

[49] Dejing Xu Jun Xiao Zhou Zhao Jian Shao Di Xie andYueting Zhuang Self-supervised spatiotemporal learning viavideo clip order prediction In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition pages10334ndash10343 2019 7 14

[50] Hang Zhao Chuang Gan Andrew Rouditchenko Carl Von-drick Josh McDermott and Antonio Torralba The soundof pixels In Proceedings of the European Conference onComputer Vision (ECCV) pages 570ndash586 2018 2

[51] Hang Zhao Zhicheng Yan Heng Wang Lorenzo Torresaniand Antonio Torralba SLAC A sparsely labeled datasetfor action classification and localization arXiv preprintarXiv171209374 2017 2

[52] Luowei Zhou Chenliang Xu and Jason J Corso Towardsautomatic learning of procedures from web instructionalvideos In Thirty-Second AAAI Conference on Artificial In-telligence 2018 1 2

[53] Yukun Zhu Ryan Kiros Rich Zemel Ruslan SalakhutdinovRaquel Urtasun Antonio Torralba and Sanja Fidler Align-ing books and movies Towards story-like visual explana-tions by watching movies and reading books In Proceed-ings of the IEEE international conference on computer vi-sion pages 19ndash27 2015 2 3

We include additional details and results for trainingthe Speech2Action model in Sec A In Sec B weshow more results for the techniques used to mine train-ing samples ndash ie the Keyword Spotting Baseline and theSpeech2Action model Finally we show results on theUCF101 [37] dataset in Sec C

A Speech2Action modelA1 Screenplay Parsing

We follow the grammar created by Winer et al [46]which is based on lsquoThe Hollywood Standardrsquo [34] anauthoritative guide to screenplay writing to parse thescreenplays and separate out various script elements Thetool uses spacing indentation capitalization and punctua-tion to parse screenplays into the following four differentelements1 Shot Headings ndash These are present at the start of eachscene or shot and may give general information about ascenes location type of shot subject of shot or time ofday eg INT CENTRAL PARK - DAY2 Stage Direction ndash This is the stage direction that is to begiven to the actors This contains the action informationthat we are interested in and is typically a paragraphcontaining many sentences eg Nason and his guysfight the fire They are CHOKING onsmoke PAN TO Ensign Menendez leadingin a fresh contingent of men to jointhe fight One of them is TITO3 Dialogue ndash speech uttered by each character eg INDYGet down4 Transitions ndash may appear at the end of a scene andindicate how one scene links to the next eg HARD CUTTO

In this work we only extract 2 Stage Direction and 3Dialogue After mining for verbs in the stage directionswe then search for the nearest section of dialogue (eitherbefore or after) and assign each sentence in the dialoguewith the verb class label (see Fig 7 for examples of verb-speech pairs obtained from screenplays)

A2 PR Curves on the Validation Set of the IMSDbData

We show precision-recall curves on the val set of theIMSDb dataset in Fig 6 Note how classes such as lsquorunrsquoand lsquophonersquo have a much higher recall for the same level ofprecision

We select thresholds for the Speech2Action modelusing a greedy search as follows (1) We allocate the re-trieved samples into discrete precision buckets (30-4040-50 etc) using thresholds obtained from the PRcurve mentioned above (2) For different actions we ad-

Figure 6 PR curves on the validation set of the IMSDb dataset forthe Speech2Action model Since the validation set is noisywe are only interested in performance in the low recall high pre-cision setting Note how some classes ndash lsquophonersquo lsquoopenrsquo and lsquorunrsquoperform much better than others

just the buckets to make sure the number of training ex-amples are roughly balanced for all classes (3) For classeswith low precision in order to avoid picking uncertain andhence noiser predictions we only select examples that hada precision above 30+

The number of retrieved samples per class can be seenin Fig 8 The number of retrieved samples for lsquophonersquoand lsquoopenrsquo at a precision value of 30 are in the millions(2272906 and 31657295 respectively) which is why wemanually increase the threshold in order to prevent a largeclass-imbalance during training We reiterate here onceagain that this evaluation is performed purely on the basisof the proximity of speech to verb class in the stage direc-tion of the movie screenplay (Fig 7) and hence it is not aperfect ground truth indication of whether an action will ac-tually be performed in a video (which is impossible to sayonly from the movie scripts) We use the stage directionsin this case as pseudo ground truth There are many casesin the movie screenplays where verb and speech pairs couldbe completely uncorrelated (see Fig 7 bottomndashright for anexample)

B Mining TechniquesB1 Keyword Spotting Baseline

In this section we provide more details about the Key-word Spotting Baseline (described in Sec 422 of the mainpaper) The total number of clips mined using the KeywordSpotting Baseline is 679049 We mine all the instancesof speech containing the verb class and if there are more

Figure 7 Examples of speech and verb action pairs obtain from screenplays In the bottom row (right) we show a possibly negative speechand verb pair ie the speech segment Thatrsquos not fair is assigned the action verb lsquorunrsquo whereas it is not clear that these two are correlated

why didnrsquot you return my phone calls they were both undone by true loversquos kissyou each get one phone call good girls donrsquot kiss and tell

phone i already got your phone line set up kiss kiss my abut my phone died so just leave a message okay it was our first kiss

irsquom on the phone i mean when they say rdquoirsquoll call yourdquo thatrsquos the kiss of deathwersquore collecting cell phones surveillance tapes video we can find i had to kiss jaceshe went to the dance with Harry Land against a top notch britisher yoursquoll be eaten alivedo you wanna dance eat my dust boys

dance and the dance of the seven veils eat ate something earlierwhat if i pay for a dance i canrsquot eat i canrsquot sleepthe dance starts in an hour you must eat the sardines tomorrowjust dance i ate bad sushiare you drunk and you can add someone to an email chain at any pointmy dad would be drinking somewhere else shersquos got a point buddy

drink you didnrsquot drink the mold point the point is theyrsquore all having a great timeletrsquos go out and drink didnrsquot advance very far i think is markrsquos pointsuper bowl is the super bowl of drinking you made your pointi donrsquot drink i watch my diet but no beside the point

Table 5 Examples of speech samples for six verb categories labelled with the keyword spotting baseline Each block shows theaction verb on the left and the speech samples on the right Since we do not need to use the movie screenplays for this baseline unlikeSpeech2Action (results in Table 2 of the main paper) we show examples of transcribed speech obtained directly from the unlabelledcorpus Note how the speech labelled with the verb lsquopointrsquo is indicative of a different semantic meaning to the physical action of lsquopointingrsquo

than 40K samples we randomly sample 40K clips The rea-son we cap samples at 40K is to prevent overly unbalancedclasses Examples of speech labelled with this baseline for6 verb classes can be seen in Table 5 There are two waysin which our learned Speech2Action model is theoreti-cally superior to this approach(1) Many times the speech correlated with a particular ac-tion does not actually contain the action verb itself eglsquoLook over therersquo for the class lsquopointrsquo

(2) There is no word-sense disambiguation in the way thespeech segments are mined ie lsquoLook at where I am point-ingrsquo vs lsquoYoursquove missed the pointrsquo Word-sense disambigua-tion is the task of identifying which sense of a word is usedin a sentence when a word has multiple meanings This tasktends to be more difficult with verbs than nouns becauseverbs have more senses on average than nouns and may bepart of a multiword phrase [5]

Figure 8 Distribution of training clips mined usingSpeech2Action We show the distribution for all 18 verbclasses It is difficult to mine clips for the actions lsquohugrsquo andlsquokickrsquo as these are often confused with lsquokissrsquo and lsquohitrsquo

Figure 9 Distribution of training clips mined using the Key-word Spotting baseline We show the distribution for all 18verb classes We cut off sampling at 40K samples for twelveclasses in order to prevent too much of a class imbalance

they just hung up pick up next message

you afraid of driving fast

i always drive the car on saturday never drive on monday

babe the speed limit is 120

because if you are just drive

just roll down the windows and dont make any stops

you want to learn that new dance thats

sweeping bostontrue but i choose to dance every time

okay what kind of dance shall we do

you want a germandance

why dont you come dance

go ahead go ahead and shoot now drop your weapon Do it drop your weapon drop your weapon

hands on the ground use the pistol

next caller call me back please

drop the gun

Can you please connect me to the tip line

but the number 2 car is rapidly hunting down

the number 3

you dance to get attention

PHONE

DRIVE

DANCE

SHOOT

Figure 10 Examples of clips mined automatically using the Speech2Action model applied to speech alone for 4 AVA classes Weshow only a single frame from each video Note the diversity in object for the category lsquo[answer] phonersquo (first row from left to right) alandline a cell phone a text message on a cell phone a radio headset a carphone and a payphone in viewpoint for the category lsquodriversquo(second row) including behind the wheel from the passenger seat and from outside the car and in background for the category lsquodancersquo(third row from left to right) inside a home on a football pitch in a tent outdoors in a clubparty and at an Indian weddingparty

B2 Mined Examples

The distribution of mined examples per class for all 18classes using the Speech2Action model and the Key-word Spotting baseline can be seen in Figures 8 and 9We note that it is very difficult to mine examples for ac-tions lsquohugrsquo and lsquokickrsquo as these are often accompanied withspeech similar to that accompanying lsquokissrsquo and lsquohitrsquo

We show more examples of automatically mined videoclips from unlabelled movies using the Speech2Actionmodel in Fig 10 Here we highlight in particular the di-versity of video clips that are mined using simply speechalone including diversity in objects viewpoints and back-ground scenes

C Results on UCF101In this section we show the results of pretraining on our

mined video examples and then finetuning on the UCF101dataset [37] following the exact same procedure describedin Sec 51 of the main paper UCF101 [37] is a dataset of13K videos downloaded from YouTube spanning over 101human action classes Our results follow a similar trendto those on HMDB51 pretraining on samples mined us-ing Speech2Action (814) outperforms training fromscratch (742) and pretraining on samples obtained usingthe keyword spotting basline (774) We note here how-ever that it is much harder to tease out the difference be-tween various styles of pretraining on this dataset becauseit is more saturated than HMDB51 (training from scratchalready yields a high accuracy of 742 and pretraining onKinetics largely solves the task with an accuracy of 957)






Table 6 Comparison with previous pre-training strategies foraction classification on UCF101 Training on videos labelledwith Speech2Action leads to a 7 improvement over trainingfrom scratch and outperforms previous self-supervised works Italso performs competitively with other weakly supervised worksKSB-mined video clips mined using the keyword spotting base-line S2A-mined video clips mined using the Speech2Actionmodel daggervideos without labels videos with labels distilledfrom ImageNet When comparing to [21] we report the numberachieved by their I3D (RGB only) model which is the closest toour architecture For 983183 we report the reimplementations by [38]using the S3D-G model (same as ours) For the rest we reportperformance directly from the original papers

the action is While it appears that in some cases the speechis correlated with action ndash lsquoRaise your glasses to rsquo in themore general domain of movies and TV shows it is morelikely that the speech is completely uncorrelated with theaction ndash lsquoHow is your day goingrsquo Hence in this workwe explicitly learn to identify when the speech is discrimi-native While the supervision we obtain from the speechndashaction correlation is still noisy we show that at scale it canprovide sufficient weak supervision to train visual classi-fiers (see Fig 1)

Luckily we have a large amount of literary content atour disposal to learn this correlation between speech andactions Screenplays can be found for hundreds of moviesand TV shows and contain rich descriptions of the identi-ties of people their actions and interactions with one an-other and their dialogue Early work has attempted to alignthese screenplays to the videos themselves and use thatas a source of weak supervision [2 9 23 26] Howeverthis is challenging due to the lack of explicit correspon-dence between scene elements in video and their textualdescriptions in screenplays [2] and notwithstanding align-ment quality is also fundamentally limited in scale to theamount of aligned movie screenplays available Instead welearn from unaligned movie screenplays We first learn thecorrelation between speech and actions from written mate-rial alone and use this to train a Speech2Action classi-fier This classifier is then applied to the speech in an un-labelled unaligned set of videos to obtain visual samplescorresponding to the actions confidently predicted from thespeech (Fig 1) In this manner the correlations can provideus with an effectively infinite source of weak training datasince the audio is freely available with movies

Concretely we make the following four contributions(i) We train a Speech2Action model from literaryscreenplays and show that it is possible to predict certainactions from transcribed speech alone without the need forany manual labelling (ii) We apply the Speech2Actionmodel to a large unlabelled corpus of videos to obtainweak labels for video clips from the speech alone (iii) Wedemonstrate that an action classifier trained with these weaklabels achieves state of the art results for action classifica-tion when fine-tuned on standard benchmarks compared toother weakly superviseddomain transfer methods (iv) Fi-nally and more interestingly we evaluate the action clas-sifier trained only on these weak labels with no fine-tuningon the mid and tail classes from the AVA dataset [15] inthe zero-shot and few-shot setting and show a large boostover fully supervised performance for some classes withoutusing a single manually labelled example

2 Related WorksAligning Screenplays to Movies A number of works haveexplored the use of screenplays to learn and automatically

annotate character identity in TV series [6 10 31 36 40]Learning human actions from screenplays has also been at-tempted [2 9 23 26 27] Crucially however all theseworks rely on aligning these screenplays to the actual videosthemselves often using the speech (as subtitles) to providecorrespondences However as noted by [2] obtaining su-pervision for actions in this manner is challenging due tothe lack of explicit correspondence between scene elementsin video and their textual descriptions in screenplays

Apart from the imprecise temporal localization inferredfrom subtitles correspondences a major limitation is thatthis method is not scalable to all movies and TV showssince screenplays with stage directions are simply notavailable at the same order of magnitude Hence previousworks have been limited to a small scale no more than tensof movies or a season of a TV series [2 9 23 26 27] Asimilar argument can be applied to works that align booksto movies [41 53] In contrast we propose a method thatcan exploit the richness of information in a modest numberof screenplays and then be applied to a virtually limitlessset of edited video material with no alignment or manualannotation required

Supervision for Action Recognition The benefits oflearning from large scale supervised video datasets for thetask of action recognition are well known with the intro-duction of datasets like Kinetics [20] spurring the develop-ment of new network architectures yielding impressive per-formance gains eg [4 11 42 44 45 48] However asdescribed in the introduction such datasets come with anexorbitant labelling cost Some work has attempted to re-duce this labeling effort through heuristics [51] (although ahuman annotator is required to clean up the final labels) orby procuring weak labels in the form of accompanying metadata such as hashtags [13]

There has also been a recent growing interest in usingcross-modal supervision from the audio streams readilyavailable with videos [1 21 32 33 50] Such methodshowever focus on non-speech audio eg lsquoguitar playingrsquothe lsquothudrsquo of a bouncing ball or the lsquocrashrsquo of waves atthe seaside rather the transcribed speech As discussed inthe introduction transcribed speech is used only in certainnarrow domains eg instruction videos [28 39 52] andlifestyle vlogs [12 18] while in contrast to these works wefocus on the domain of movies and TV shows (where thelink between speech and actions is less explicit) Furthersuch methods use most or all the speech accompanying avideo to learn a better overall visual embedding whereaswe note that often the speech is completely uninformativeof the action Hence we first learn the correlation betweenspeech and actions from written material and then applythis knowledge to an unlabelled set of videos to obtainvideo clips that can be used directly for training











google-researchbert












41 Unlabelled Data











42 68 18 41 27 51 83 52 18 27








you dance yank












donrsquot hit

DANCE

HIT





EAT

POINT

DRINK

RUN




PHONE

KISS











53 Results



















COUNT




































































































drop the gun



the number 3


PHONE

DRIVE

DANCE

SHOOT


B2 Mined Examples





















google-researchbert












41 Unlabelled Data











42 68 18 41 27 51 83 52 18 27








you dance yank












donrsquot hit

DANCE

HIT





EAT

POINT

DRINK

RUN




PHONE

KISS











53 Results



















COUNT




































































































drop the gun



the number 3


PHONE

DRIVE

DANCE

SHOOT


B2 Mined Examples






















41 Unlabelled Data











42 68 18 41 27 51 83 52 18 27








you dance yank












donrsquot hit

DANCE

HIT





EAT

POINT

DRINK

RUN




PHONE

KISS











53 Results



















COUNT




































































































drop the gun



the number 3


PHONE

DRIVE

DANCE

SHOOT


B2 Mined Examples




















42 68 18 41 27 51 83 52 18 27








you dance yank












donrsquot hit

DANCE

HIT





EAT

POINT

DRINK

RUN




PHONE

KISS











53 Results



















COUNT




































































































drop the gun



the number 3


PHONE

DRIVE

DANCE

SHOOT


B2 Mined Examples














you dance yank












donrsquot hit

DANCE

HIT





EAT

POINT

DRINK

RUN




PHONE

KISS











53 Results



















COUNT




































































































drop the gun



the number 3


PHONE

DRIVE

DANCE

SHOOT


B2 Mined Examples













53 Results



















COUNT




































































































drop the gun



the number 3


PHONE

DRIVE

DANCE

SHOOT


B2 Mined Examples





















COUNT




































































































drop the gun



the number 3


PHONE

DRIVE

DANCE

SHOOT


B2 Mined Examples









































































































drop the gun



the number 3


PHONE

DRIVE

DANCE

SHOOT


B2 Mined Examples










































































drop the gun



the number 3


PHONE

DRIVE

DANCE

SHOOT


B2 Mined Examples
















































drop the gun



the number 3


PHONE

DRIVE

DANCE

SHOOT


B2 Mined Examples




































drop the gun



the number 3


PHONE

DRIVE

DANCE

SHOOT


B2 Mined Examples



























drop the gun



the number 3


PHONE

DRIVE

DANCE

SHOOT


B2 Mined Examples



















Date post:	22-Sep-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Speech2Action: Cross-modal Supervision for Action...

Documents