+ All Categories
Home > Documents > Leveraging the Present to Anticipate the Future in Videos · dient difference loss. Vondrick et al....

Leveraging the Present to Anticipate the Future in Videos · dient difference loss. Vondrick et al....

Date post: 02-Sep-2019
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
8
Leveraging the Present to Anticipate the Future in Videos Antoine Miech 1,2 Ivan Laptev 1,2 Josef Sivic 1,2,3 Heng Wang 4 Lorenzo Torresani 4 Du Tran 4 1 Inria 2 ´ Ecole Normale Sup´ erieure 3 CIIRC 4 Facebook AI Abstract Anticipating actions before they are executed is crucial for a wide range of practical applications including au- tonomous driving and robotics. While most prior work in this area requires partial observation of executed ac- tions, in the paper we focus on anticipating actions sec- onds before they start. Our proposed approach is the fu- sion of a purely anticipatory model with a complemen- tary model constrained to reason about the present. In particular, the latter predicts present action and scene at- tributes, and reasons about how they evolve over time. By doing so, we aim at modeling action anticipation at a more conceptual level than directly predicting future ac- tions. Our model outperforms previously reported methods on the EPIC-KITCHENS and Breakfast datasets. 1. Introduction Automatic video understanding has improved signifi- cantly over the last few years. Such advances have mani- fested in disparate video understanding tasks, including ac- tion recognition [5, 8, 11, 38, 41], temporal action local- ization [37, 39, 47, 53], video search [13], video summa- rization [32] and video categorization [29]. In this work, we focus on the problem of anticipating future actions in videos as illustrated in Figure 1. A significant amount of prior work [5, 8, 11, 21, 23, 38, 41, 42, 46] in automatic video understanding has focused on the task of action recognition. The goal of action recog- nition is to recognize what action is being performed in a given video. While accurate recognition is crucial for a wide range of practical applications such as video catego- rization or automatic video filtering, certain settings do not allow for complete and even partial observation of action before it happens. For instance, an autonomous car should be able to recognize the intent of a pedestrian to cross the road much before the action is actually initiated in order to avoid an accident. In practical applications where we seek to act before an action gets executed, being able to antici- pate the future given the present is critical. Anticipating the future, especially long-term, is a chal- Future Observed Video Future action to anticipate: Long Jump Future action to anticipate: Crossing Road Figure 1: Action anticipation. Examples of action antic- ipation in which the goal is to anticipate future actions in videos seconds before they are performed. lenging task because the future is not deterministic: several outcomes are possible given the current observation. To re- duce uncertainty, most work in this field [2, 15, 18, 25, 35, 36] requires partially observed execution of actions. In this paper, we address the task of action anticipation even when no partial observation of the action is available. While prior work [7, 20, 27, 43] has addressed this same task, in this work, we specifically focus on better leveraging recogni- tion models to improve future action prediction. We pro- pose a fusion of two approaches: one directly anticipates the future while the other first recognizes the present and then anticipates the future, given the present. We have em- pirically observed complementary of these two approaches when evaluating on three distinct and diverse benchmarks: EPIC-KITCHENS [6], Breakfast [19] and ActivityNet [4]. 1.1. Contributions The contributions of our work are: (i) We propose a new framework for the task of anticipating human actions several seconds before they are performed. Our model is decomposed into two complementary models. The first, named the predictive model, anticipates action directly from the visual inputs. The second one, the transitional model, is first constrained to predict what is happening in the ob-
Transcript

Leveraging the Present to Anticipate the Future in Videos

Antoine Miech1,2 Ivan Laptev1,2 Josef Sivic1,2,3 Heng Wang4 Lorenzo Torresani4 Du Tran4

1Inria 2Ecole Normale Superieure 3CIIRC 4Facebook AI

Abstract

Anticipating actions before they are executed is crucialfor a wide range of practical applications including au-tonomous driving and robotics. While most prior workin this area requires partial observation of executed ac-tions, in the paper we focus on anticipating actions sec-onds before they start. Our proposed approach is the fu-sion of a purely anticipatory model with a complemen-tary model constrained to reason about the present. Inparticular, the latter predicts present action and scene at-tributes, and reasons about how they evolve over time.By doing so, we aim at modeling action anticipation at amore conceptual level than directly predicting future ac-tions. Our model outperforms previously reported methodson the EPIC-KITCHENS and Breakfast datasets.

1. IntroductionAutomatic video understanding has improved signifi-

cantly over the last few years. Such advances have mani-fested in disparate video understanding tasks, including ac-tion recognition [5, 8, 11, 38, 41], temporal action local-ization [37, 39, 47, 53], video search [13], video summa-rization [32] and video categorization [29]. In this work,we focus on the problem of anticipating future actions invideos as illustrated in Figure 1.

A significant amount of prior work [5, 8, 11, 21, 23, 38,41, 42, 46] in automatic video understanding has focusedon the task of action recognition. The goal of action recog-nition is to recognize what action is being performed in agiven video. While accurate recognition is crucial for awide range of practical applications such as video catego-rization or automatic video filtering, certain settings do notallow for complete and even partial observation of actionbefore it happens. For instance, an autonomous car shouldbe able to recognize the intent of a pedestrian to cross theroad much before the action is actually initiated in order toavoid an accident. In practical applications where we seekto act before an action gets executed, being able to antici-pate the future given the present is critical.

Anticipating the future, especially long-term, is a chal-

FutureObserved Video

Future action to anticipate: Long Jump

Future action to anticipate:Crossing Road

Figure 1: Action anticipation. Examples of action antic-ipation in which the goal is to anticipate future actions invideos seconds before they are performed.

lenging task because the future is not deterministic: severaloutcomes are possible given the current observation. To re-duce uncertainty, most work in this field [2, 15, 18, 25, 35,36] requires partially observed execution of actions. In thispaper, we address the task of action anticipation even whenno partial observation of the action is available. While priorwork [7, 20, 27, 43] has addressed this same task, in thiswork, we specifically focus on better leveraging recogni-tion models to improve future action prediction. We pro-pose a fusion of two approaches: one directly anticipatesthe future while the other first recognizes the present andthen anticipates the future, given the present. We have em-pirically observed complementary of these two approacheswhen evaluating on three distinct and diverse benchmarks:EPIC-KITCHENS [6], Breakfast [19] and ActivityNet [4].

1.1. Contributions

The contributions of our work are: (i) We propose anew framework for the task of anticipating human actionsseveral seconds before they are performed. Our model isdecomposed into two complementary models. The first,named the predictive model, anticipates action directly fromthe visual inputs. The second one, the transitional model,is first constrained to predict what is happening in the ob-

served time interval and then leverages this prediction toanticipate the future actions. (ii) We present extensive ex-periments on three datasets with state-of-the-art results onthe EPIC-KITCHENS [6] and Breakfast action dataset [19].In addition, our model provides ways to explain its ouputs,which allows us to easily interpret our model as we demon-strate in our qualitative analysis.

2. Related workPredicting the future is a big area. Our work touches on

future frame, motion and semantic mask prediction as wellas human trajectory and action prediction, which we reviewbelow.

Future pixel, motion or semantic mask prediction. Fu-ture frame prediction has recently attracted many researchefforts. Mathieu et al. [28] predict future frames of a videoby proposing a multi-scale network architecture. They trainit using an adversarial approach to minimize an image gra-dient difference loss. Vondrick et al. [44] also generatefuture video frames using a transformation of pixels fromthe past. Xue et al. [51] instead propose a probabilisticmodel to generate future frames from a single image. Oh etal. [30] predict action dependent future frames in old-schoolAtari video games. Finn et al. [10] explore video predic-tion for real-world interactive robot agents. Instead of di-rectly predicting future pixels, Luc et al. [24] aim at predict-ing future semantic segmentation mask in videos, Walker etal. [45] explore future pose prediction in videos and Pintealet al. [31] predict motion from single still images. Our workdiffers from them as we predict future action labels insteadof predicting pixel level information.

Human trajectory prediction. Predicting human trajec-tories has also received wide attention [1, 26, 17, 50]. Kitaniet al. [17] approach this task by casting it as an inverse re-inforcement learning (IRL) problem. Alahi et al. [1] modelprediction of human trajectories as a sequence generationtask and propose to generate these trajectories using recur-rent neural networks (LSTMs). These work differs from ouras they are predicting an entire sequence of future locationsinstead of a single action.

Early-stage action recognition. Our work is related tothe field of action anticipation and early-stage action predic-tion. A large body of work [2, 15, 18, 25, 35, 36] focuses onpredicting actions given partially observed executions. Thissetting differs from action recognition [5, 8, 11, 21, 23, 38,41, 42, 46] or temporal action detection [37, 39, 47, 53], asit assumes access to a small fraction (the beginning) of anaction. One early work in this genre is from Ryoo [35], whouses dynamic bag-of-words to efficiently model the feature

distribution change over time. Hoai et al. [15] and Ma etal. [25] formulate this task as a ranking problem where amonotonically non-decreasing prediction score is enforcedas visual evidences are being accumulated. Similarly toVondrick et al. [43], Shi et al. [36] aim first at regressingfuture visual feature vectors. These feature vectors are thenused as input to an action recognition model for the early-stage action prediction. Our work differs from early-stageaction prediction, as we aim at predicting an action evenbefore it has actually started.

Action anticipation. Prior efforts [7, 17, 20, 27, 43] havebeen addressing the task of anticipating action before theyare executed. The work of Farha et al. [7] aims at predict-ing not only one but a sequence of future actions. How-ever, their experiments concern a restricted setup with astrong “action grammar” specific to cooking videos withpredefined recipes [19]. Our work is not restricted to thesetype of datasets since we are also experimenting on un-scripted cooking video dataset (EPIC-KITCHENS [6]) andnon cooking video dataset (ActivityNet 200 [4]). Also theiraction anticipation approach can only be applied to videoswith annotated sequence of actions whereas our method canbe applied to any type of video dataset. Similarly to Mah-mud et al. [27], their system is also trained to predict thestart of the next action. Anticipating events before theyoccur is also used to predict traffic accidents [16, 40, 52].Prior work also applied action anticipation in the domainof sports analytics such as basketball [3, 9], water polo [9],tennis and soccer [48]. Such models aim to anticipate fu-ture trajectories of a ball and individual players. IRL hasalso been recently applied to activity forecasting from first-person egocentric daily activity videos [33]. On the otherhand, Wu et al. [49] combine on-wrist motion accelerome-ter and camera to perform daily intention anticipation. Notethat several of these systems [1, 2, 7, 25, 27, 33, 36, 40, 49]employ the use of recurrent neural networks (RNNs) to ad-dress the sequential nature of these predictive tasks.

3. Action Anticipation ModelOur goal is to anticipate an action T seconds before it

starts. More formally, let V denote a video. Then we indi-cate with Va:b the segment of V starting at time a and end-ing at time b, and with yc the label of the action that starts attime c . We would like to find a function f such that f(V0:t)predicts yt+T . The main idea behind our model is that wedecompose f as a weighted average of two functions, a pre-dictive model fpred and a transitional model ftrans:

f = αfpred + (1− α)ftrans, α ∈ [0, 1], (1)

where α is a dataset dependent hyper-parameter chosenby validation. The first function fpred is trained to predict

Video timeline

Future action

Present

T sec

FuturePast

CNN Feature extraction

Anticipated ActionPour Courgette

onto Pan

Figure 2: Overview of our approach. Our task is to pre-dict an action T seconds before it starts to be performed.Our model is a combination of two complementary mod-ules: the predictive model and the transitional model. Whilethe predictive model directly anticipates the future action,the transitional model is first constrained to output what iscurrently happening. Then, it uses this information to antic-ipate future actions.

the future action directly from the observed segment. Onthe other hand, ftrans is first constrained to compute high-level properties of the observed segment (e.g., attributes orthe action performed in the present). Then, in a secondstage, ftrans uses this information to anticipate the futureaction. In the next subsections we explain how to learnfpred and ftrans. Figure 2 presents an overview of the pro-posed model.

3.1. Predictive model fpred

The goal of the predictive model fpred is to directly an-ticipate future action from the visual input. As opposedto ftrans, fpred is not subject to any specific constraint.Suppose that we are provided with a training video V withaction labels yt0+T , . . . , ytn+T . For each label yti+T , wewant to minimize the loss:

l(fpred(Vs(ti):ti), yti+T ), (2)

where s(ti) = max(0, ti−tpred), l is the cross entropy loss,tpred is a dataset dependent hyper-parameter, also chosenby validation, that represents the maximum temporal inter-val of a video fpred has access to. This hyper-parameter isessential because looking too much in the past may add ir-relevant information that degrades prediction performance.This loss is then summed up over all videos from the train-ing dataset. In this work, fpred is a linear model which takesas input a video descriptor which we describe in section 4.2.

Action state s

Action Recognition based transitional model

Dicing CourgetteCutting CourgetteWashing Courgette

…Open Fridge

Markov TransitionMatrix T

Observed video CNN

Visual attributes state s

Anticipated Action:

Applying sunscreen

Low rank Linear model

Transition

Observed video CNN

SunscreenLotionSwimming trunk

…Boat

Anticipated Action:

Pour Courgetteonto Pan

Visual Attributes based transitional model

Figure 3: Illustration of our transitional models. Up-per: our Action Recognition (AR) based transitional modellearns to prediction future actions based on the predictionsof an action recognition classifier applied on current/presentframes (clips). Lower: our Visual Attributes (VA) basedtransitional model learns to predict future actions based onvisual attributes of the current/present frames (clips).

3.2. Transitional model ftransThe transitional model ftrans splits the prediction into

two stages: gs and gt. The first stage gs aims at recog-nizing a current state s, describing the observed video seg-ment. The state s can represent an action or a latent action-attribute. The second stage gt takes as input the current states, and anticipates the next action given the current state s.gs can be thought of as a complex function extracting high-level information from the observed video segment, whilegt is a simple (in fact, linear) function operating on the states and modeling the correlation between the present stateand the future action. We will next explain in detail howwe define the current state s and how we model the transi-tion function gt. We propose two different approaches forour transitional model: one that is based on action recogni-tion and one that relies on visual attributes, as illustrated inFigure 3.

Transitional Model based on Visual Attributes. In thisapproach, we leverage visual attributes [23] to anticipatethe future. Visual attributes have been previously used foraction recognition by Liu et al. [23]. The idea is to firstpredefine a set of visual attributes describing the presenceor absence of objects, scenes or atomic actions in a video.Then, a model is trained on these visual attributes for actionrecognition. In this work, we instead use visual attributesas a means to express the transitional model. The currentstate s ∈ [0, 1]a predicted by gs, is then a vector of visualattributes probabilities, where a is the number of visual at-tributes. Given the presently observed visual attribute s, gt

predicts the future action. We model gt as a low-rank linearmodel:

gt(s) =W2(W1s + b1) + b2, (3)

where W1 ∈ Rr×a,W2 ∈ RK×r, b1 ∈ Rr, b2 ∈ RK ,K ∈ N the number of action classes and r is the rankof gt. These parameters are learned, in the same manneras the predictive model, by minimizing the cross entropyloss between the predicted action given by gt(s) and thefuture action ground-truth. Implementing gt through a low-rank model reduces the number of parameters to estimate.Empirically, we found that this leads to better accuracy, asshown in our experiments. The lower part of Figure 3 illus-trates this case.

Transitional Model based on Action Recognition.Real-world videos often consist of a sequence of elementaryactions performed by a person in order to reach a final goalsuch as Preparing coffee, Changing car tire or Assemblinga chair. Many datasets come with a training set where eachvideo has been annotated with action labels and segmentboundaries for all occurring actions (e.g EPIC-KITCHENS,Breakfast). When this is available we can use action labelsinstead of predefined visual attributes for state s. The in-tuition behind our claim is the fact that the anticipation ofthe next action significantly depends on the present being-performed action. In other words, we make a Markov as-sumption on the sequence of performed actions. More for-mally, suppose we are provided with an ordered sequenceof action annotations (a0, . . . , aN ) ∈ {1, . . . ,K}N for agiven video, where an defines the action class performed invideo segment Vn. We propose to model P (an+1 = i|Vn)as follows:

P (an+1 = i|Vn) =K∑j=1

P (an+1 = i| an = j)P (an = j|Vn)

(4)

∀n ∈ {0, . . . , N − 1}, i ∈ {1, . . . ,K}. This reformulationdecomposes the computation of P (an+1 = i|Vn) in termsof two factors: 1) an action recognition model gs(Vn) thatpredicts P (an = j|Vn), i.e., the action being performed inthe present; 2) a transition matrix T that captures the statis-tical correlation between the present and the future action,i.e., such that Tij ≈ P (an+1 = i| an = j). In this scenario,gt takes as input the probability scores of each action givenby gs to anticipate the next action in a probabilistic manner:

gt(s) = T s, (5)

P (an+1 = i) =

K∑j=1

Ti,jsj = [gt(s)]i . (6)

In practice, we compute T by estimating the conditionalprobabilities between present and future actions from thethe sequences of action annotations in the training set. Thetop part of Figure 3 illustrates this model.

Prediction Explainability. The transitional model ftransprovides interpretable predictions that can be easily ana-lyzed for explanation. Indeed, the function gt of the tran-sitional model takes the form of a simple linear model ap-plied to the state s, both when using visual attributes as wellas when using action predictions. The linear weights of gtcan be interpreted as conveying the importance of each ele-ment in s for the anticipation of the action class. For exam-ple, given an action class k to anticipate, we can analyze thelinear weights of gt to understand which visual attributes oraction class are most responsible for the prediction of actionclass k.

It also provides an easy way to diagnose the sourceof mispredictions. For example, suppose the transitionalmodel anticipates wrongly an action k and we seek tounderstand the reason behind such misprediction. Letv1, . . . , va ∈ [0, 1] be the vectors encoding the visual at-tributes (or action recognition scores) for this wrong pre-diction. Let also wk,1, . . . , wk,a ∈ R be the learned lin-ear weights associated to the prediction of action classk. The top factor for the prediction of action k ismaxi∈[1,a] (wk,ivi). By analyzing this top factor, we canunderstand whether the misprediction is due to a recogni-tion problem (i.e. wrong detection score for the visual at-tribute/action class) or due to the learned transition weights.

4. ExperimentsIn this section, we evaluate our approach on three

datasets. Then we provide an ablation study, compare ourmethod with the state-of-the-arts and present qualitativeanalysis of the transitional model.

4.1. Datasets

These datasets were picked because they are diverse andcontain accurate annotated action temporal segments neces-sary for the evaluation of action anticipation.

EPIC-KITCHENS. EPIC-KITCHENS [6] is a large-scale cooking video dataset containing 39,594 accurate tem-poral segment action annotations. Each video is composedof a sequence of temporal segment annotations. Three dif-ferent tasks are proposed together with the dataset: objectdetection, action recognition and action anticipation. Theaction anticipation task is to predict an action one secondbefore it has started. The dataset contains three differentsplits: the training set, the seen kitchens test set (S1) com-posed of videos from kitchens also appearing in the training

Model Pretrain Fine-tune Action Verb Noun

ResNet-50 Imagenet No 3.4 24.5 7.4R(2+1)D-18 Kinetics No 5.2 27.2 10.3R(2+1)D-18 Kinetics EK-Anticip. 5.0 24.6 9.7R(2+1)D-18 Kinetics EK-Recogn. 6.0 27.6 11.6

Table 1: Effects of pre-training. Action anticipation top-1 perclip accuracy on EPIC-KITCHENS with different models and pre-training datasets.

set and finally the unseen kitchens test set (S2) with kitchensthat are not appearing in the training set. A publicly avail-able challenge is also organized to keep track of the bestperforming approach on this anticipation task. Because ofthis public challenge, the labels of S1 and S2 test sets are notavailable. Thus, most of our results are reported on our val-idation set composed of the following kitchens: P03, P14,P23 and P30. We also report results evaluated by the chal-lenge organizers on the held-out test set. Unless specifiedotherwise, for comparison purposes, we report experimentswith T = 1 sec.

Breakfast. The Breakfast action dataset [19] is an anno-tated cooking video dataset of people preparing breakfastmeals. It comes with 11267 temporal segment action an-notations. Each video is also composed of a sequence oftemporal action segment annotations. The dataset is parti-tioned into four different train / test splits: S1, S2, S3 andS4. We quantify performance with the average scores overall of the four splits. Unless specified differently, for com-parison purposes, we report experiments with T = 1 sec.

ActivityNet 200. The ActivityNet 200 video dataset [4]contains 15410 temporal action segment annotations in thetraining set and 7654 annotations in the validation set. Thisvideo dataset is mainly used for evaluating action localiza-tion models but as the videos are provided accurate tempo-ral segment for each action, we can also use them to evalu-ate models on action anticipation. As opposed to the EPIC-KITCHENS [6] and Breakfast [19] datasets, each videocontains only one single action annotation instead of a se-quence of action segments. For this reason, we cannottest on ActivityNet the transitional model based on actionrecognition. We only train and evaluate on videos in thedatasets with at least 10 seconds of video before the actionstarts. In total, the training and validation sets consists ofrespectively 9985 and 4948 action localization annotations.

4.2. Video Representation

In this subsection we discuss how we represent the ob-served video segment V to perform action prediction. Ouroverall strategy is to split the video into clips, extract clips

Action Verb NounA@1 A@5 A@1 A@5 A@1 A@5

Transitional (VA) 4.6 12.1 25.0 71.7 9.1 24.5Transitional (AR) 5.1 17.1 25.2 72.0 12.1 33.2Predictive 6.3 17.3 27.4 73.1 11.9 31.5Predictive + Transitional (VA) 6.8 18.1 28.4 74.0 12.5 33.0Predictive + Transitional (AR) 6.7 19.1 27.3 73.5 12.9 34.6

Transitional (AR with GT) 16.1 29.4 29.3 63.3 30.7 44.4Action recognition 12.1 30.0 39.3 80.0 23.1 49.3

Table 2: Transitional and predictive model ablation. Transi-tional model and predictive model ablation study on our EPIC-KITCHENS validation set with T = 1 sec. VA and AR denotefor Visual Attributes and Action Recognition. Grey rows shouldbe interpreted as accuracies upper bounds.

representation and perform pooling over these clips. Givenan input video segment V , we uniformly split it into smallclips V = [V1, . . . , VN ] where each clip Vi, i ∈ [1, N ] isshort enough (e.g. 8 or 16 frames) that it can be fed into apretrained video CNN C. From the penultimate layer of theCNN we extract an L2-normalized one-dimensional repre-sentation C(Vi) for each clip Vi. Then we perform a tempo-ral aggregation Agg([C(V1), . . . , C(VN )]) of the extractedfeatures in order to get a one-dimensional video representa-tion for V . In our experiments, C is the R(2+1)D networkof 18-layers from Tran et al. [41]. We perform a simplemax pooling to aggregate features from all clips, but moresophisticated temporal aggregation techniques [29] can alsobe used in our model.

Visual Attributes. Our visual attributes presented in Sec-tion 3.2 include the taxonomies of Imagenet-1000 [34],Kinetics-600 [5] and Places-365 [54]. We train two ResNet-50 [12] CNN models: one on Imagenet-1000 and the otherone on Places-365. For the Kinetics-600 taxonomy, we traina R(2+1)D-18 [41] model. In total, our set is composed of1965 (1000+600+365) visual attributes. We densely extractthese visual attributes every 0.5 seconds and apply the tem-poral max pooling operation to obtain a single vector foreach video, as discussed above.

Leveraging the present for pretraining. In previouswork [6, 43] the video representation was learned by fine-tuning a pretrained video CNN on the task of action an-ticipation. Instead, we propose to finetune the CNN rep-resentation on the task of action recognition on the targetdataset. More specifically, instead of training the CNN onvideo clips sampled before action starts, we train it on clipssampled in the action segment interval. This is motivated bythe fact that the task of action recognition is “easier’ than ac-tion anticipation and thus it may lead to better feature learn-ing. Table 1 reports accuracies on the EPIC-KITCHENSvalidation set obtained with our predictive model applied to

Model Accuracy

Random baseline 0.3Predictive 51.6Transitional (All VA, Full rank) 48.0Transitional (Object & Scene VA, Low rank, r = 256) 37.0Transitional (All VA, Low rank, r = 256) 52.8Predictive + Transitional (VA, Low rank) 54.8

Table 3: Results on ActivityNet action anticipation. Our meth-ods compared with baseline on our validation set with T = 5 sec.VA stands for Visual Attributes.

different CNN representations. These results illustrate thebenefit of fine-tuning the CNN on action recognition, in-stead of action anticipation as done in prior work [6, 43].The Table provide also numbers for two additional base-lines corresponding to 1) using the CNN pretrained on Ki-netics without finetuning and 2) extracting features froma ResNet-50 2D CNN pretrained on Imagenet. It can benoted that the best accuracies for actions, verbs and nounsare obtained with the CNN finetuned on the action recog-nition task of EPIC-KITCHENS. Based on these results, inthe rest of the work, we use CNN features computed froma R(2+1)D-18 first pretrained on Kinetics [5] and then fine-tuned for action recognition on the target dataset.

4.3. Ablation study

In order to understand the benefits of the different com-ponents in our model, we evaluate the predictive model sep-arately from the transitional model. For the transitionalmodel we report results for both the variant based on Vi-sual Attributes (VA) as well as the version based on ActionRecognition (AR). Table 2 summarizes the results achievedon the validation set of EPIC-KITCHENS [6]. The ARtransitional model performs better than the VA transitionalmodel. However, both are outperformed by the purely-predictive model. Interestingly, combining the predictivemodel with either of the two transitional models yields fur-ther accuracy gains. This suggests that the predictions arecomplementary.

We also show in grey, an accuracy upper bound achievedwhen directly recognizing the future frame as opposed topredicting from the past one (row Action recognition). Thegrey row Transitional (AR with GT) experiments shows theaccuracy achieved when the transitional model is providedthe groundtruth label of the last observed action. The im-provement when using the groundtruth label is significant.This suggests that a large cause of missing performanceis weak action recognition models and that better actionrecognition will produce stronger results for prediction.

We also perform ablation studies on the ActivityNetdataset in Table 3. Since we are not provided sequencesof action annotations in this dataset, for this experiment we

can only apply the transitional model based on Visual At-tributes. Here again, we demonstrate the complementarityof the predictive and transitional models. The average ofboth approaches provides the best results for action antici-pation. We also show the importance of modeling gt as alow-rank linear model on visual attributes. Constraining gtto be a low-rank linear model provides a boost of more than4% in accuracy.

4.4. Comparison to the state-of-the-art

We compare our approach to the state-of-the-art on boththe EPIC-KITCHENS and the Breakfast dataset. Table 5shows our method compared to the recent work of Farha etal. [7]. The numbers for Vondrick et al. [43] are based onthe reimplementation of this method provided in [7]. Ta-ble 4 reports results obtained from the EPIC-KITCHENSunseen kitchens action anticipation challenge submissionserver. Note that our EPIC-KITCHENS submission is doneunder the anonymous nickname of masterchef and is re-ported by the row Ours (Predictive [D] + Transitional) inthis paper. On both datasets, our method outperforms allpreviously reported results under almost all metrics. Notethat our best submitted model on the EPIC-KITCHENSchallenge is simple and does not make use of any ensem-bling nor optical flow input.

4.5. Qualitative analysis

As explained in subsection 3.2, through the analysis ofthe transitional model ftrans, we can analyze which visualattributes are responsible for the anticipation of each actionclass. To do so, we analyze the linear weights from gt (3)to list the top visual attributes maximizing the predictionof each action class. Table 6 shows some action classesfrom the ActivityNet 200 [4] dataset and the top-3 visualattributes that maximize their anticipation. For instance, wecan observe that identifying a Border collie dog (A dog spe-cialized in the activity of disc dog) in a video is useful forthe prediction of the Disc dog action class. RecognizingLemon and Measure cup is indicative for the anticipation ofMaking lemonade.

5. ConclusionWe have described a new model for future action an-

ticipation. The main motivating idea for our method is tomodel action anticipation as a fusion of two complemen-tary modules. The predictive approach is a purely antici-patory model. It aims at directly predicting future actiongiven the present. On the other hand, the transitional modelis first constrained to recognize what is currently seen andthen uses this output to anticipate future actions. Our ap-proach achieves state-of-the-art action anticipation perfor-mances on the EPIC-KITCHENS [6] and Breakfast [19]datasets.

Action Verb NounA@1 A@5 P R A@1 A@5 P R A@1 A@5 P R

Damen et al. (TSN Fusion) [6] 1.7 9.1 1.0 0.9 25.4 68.3 13.0 5.7 9.8 27.2 5.1 5.6Damen et al. (TSN Flow) [6] 1.8 8.2 1.1 0.9 25.6 67.6 10.8 6.3 8.4 24.6 5.0 4.7Damen et al. (TSN RGB) [6] 2.4 9.6 0.9 1.2 25.3 68.3 7.6 6.1 10.4 29.5 8.8 6.7DMI-UNICT 7.3 18.8 2.5 4.0 27.2 69.3 13.6 9.2 12.4 30.7 8.7 8.9Ours (Predictive) 6.1 18.0 1.6 2.9 27.5 71.1 12.3 8.4 10.8 30.6 8.6 8.7Ours (Predictive + Transitional) 7.2 19.3 2.2 3.4 28.4 70.0 11.6 7.8 12.4 32.2 8.4 9.9

Table 4: EPIC-KITCHENS results on hold-out unseen test set S2. The official ranking is based on the action top 1 accuracy score(A@1). A@1: top-1 accuracy, A@5: top-5 accuracy, P: precision, R: recall. Challenge website details: https://competitions.codalab.org/competitions/20071. Note that our best model was submitted under the anonymous nickname masterchef.

Model Accuracy

Random baseline 2.1Vondrick et al. [43] 8.1Abu Farha et al. (CNN) [7] 27.0Abu Farha et al. (RNN) [7] 30.1Ours (Transitional (AR)) 23.9Ours (Predictive) 31.9Ours (Predictive + Transitional (AR)) 32.3Ours (Transitional (AR with GT)) 43.0

Table 5: Comparison to state-of-the-art on the Breakfast. Wereport anticipation accuracy averaged over all of the test splits ofBreakfast dataset[19] and use T = 1 sec.

Action to anticipate Top-3 visual attributes

Applying sunscreen Sunscreen, Lotion, Swimming trunkBull fighting Ox, Bulldozing, BullringCamel ride Arabian camel, Crane, Riding scooterDisc dog Border collie, Collie, BorzoiDrinking coffee Hamper, Coffee mug, EspressoMaking an omelette Cooking egg, Wok, Shaking headMaking lemonade Lemon, Measure cup, PitchPlaying ice hockey Hockey arena, Hokey stop, TeapotPreparing pasta Guacamole, Carbonara, Frying panPreparing salad Wok, Head cabbage, WinkingRaking leaves Hay, Sweeping floor, RapeseedUsing parallel bars Parallel bars, High jump, Coral fungus

Table 6: Top-3 attributes that indicative of actions. Top-3 vi-sual attributes activations for the anticipation of some action classfrom the ActivityNet 200 dataset.

Acknowledgment. The project was partially supported bythe Louis Vuitton - ENS Chair on Artificial Intelligence,the ERC grant LEAP (No. 336845), the CIFAR Learningin Machines&Brains program, and the European RegionalDevelopment Fund under the project IMPACT (reg. no.CZ.02.1.01/0.0/0.0/15 003/0000468).

References[1] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei,

and S. Savarese. Social lstm: Human trajectory prediction incrowded spaces. In CVPR, 2016. 2

[2] M. S. Aliakbarian, F. S. Saleh, M. Salzmann, B. Fernando,L. Petersson, and L. Andersson. Encouraging lstms to antic-ipate actions very early. In ICCV, 2017. 1, 2

[3] G. Bertasius and J. Shi. Using cross-model egosupervi-sion to learn cooperative basketball intention. arXiv preprintarXiv:1709.01630, 2017. 2

[4] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Car-los Niebles. Activitynet: A large-scale video benchmark forhuman activity understanding. In CVPR, 2015. 1, 2, 5, 6

[5] J. Carreira and A. Zisserman. Quo vadis, action recognition?a new model and the kinetics dataset. In CVPR, 2017. 1, 2,5, 6

[6] D. Damen, H. Doughty, G. M. Farinella, S. Fidler,A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett,W. Price, et al. Scaling egocentric vision: The epic-kitchensdataset. In ECCV, 2018. 1, 2, 4, 5, 6, 7

[7] Y. A. Farha, A. Richard, and J. Gall. When will you dowhat?-anticipating temporal occurrences of activities. InCVPR, 2018. 1, 2, 6, 7

[8] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutionaltwo-stream network fusion for video action recognition. InCVPR, 2016. 1, 2

[9] P. Felsen, P. Agrawal, and J. Malik. What will happen next?forecasting player moves in sports videos. In ICCV, 2017. 2

[10] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learn-ing for physical interaction through video prediction. InNIPS, 2016. 2

[11] R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell.Actionvlad: Learning spatio-temporal aggregation for actionclassification. In CVPR, 2017. 1, 2

[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learningfor Image Recognition. In CVPR, 2016. 5

[13] L. A. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell,and B. Russell. Localizing moments in video with naturallanguage. ICCV, 2017. 1

[14] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledgein a neural network. arXiv preprint arXiv:1503.02531, 2015.

[15] M. Hoai and F. De la Torre. Max-margin early event detec-tors. 2014. 1, 2

[16] A. Jain, H. S. Koppula, B. Raghavan, S. Soh, and A. Saxena.Car that knows before you do: Anticipating maneuvers vialearning temporal driving models. In ICCV, 2015. 2

[17] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert.Activity forecasting. In ECCV, 2012. 2

[18] Y. Kong, Z. Tao, and Y. Fu. Deep sequential context net-works for action prediction. In CVPR, 2017. 1, 2

[19] H. Kuehne, A. Arslan, and T. Serre. The language of actions:Recovering the syntax and semantics of goal-directed humanactivities. In CVPR, 2014. 1, 2, 5, 6, 7

[20] T. Lan, T.-C. Chen, and S. Savarese. A hierarchical repre-sentation for future action prediction. In ECCV, 2014. 1,2

[21] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.Learning realistic human actions from movies. In CVPR,2008. 1, 2

[22] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L.-J. Li. Learningfrom noisy labels with distillation. In ICCV, 2017.

[23] J. Liu, B. Kuipers, and S. Savarese. Recognizing human ac-tions by attributes. In CVPR, 2011. 1, 2, 3

[24] P. Luc, N. Neverova, C. Couprie, J. Verbeek, and Y. Lecun.Predicting deeper into the future of semantic segmentation.In ICCV, 2017. 2

[25] S. Ma, L. Sigal, and S. Sclaroff. Learning activity progres-sion in lstms for activity detection and early detection. InCVPR, 2016. 1, 2

[26] W.-C. Ma, D.-A. Huang, N. Lee, and K. M. Kitani. Forecast-ing interactive dynamics of pedestrians with fictitious play.In CVPR, 2017. 2

[27] T. Mahmud, M. Hasan, and A. K. Roy-Chowdhury. Jointprediction of activity labels and starting times in untrimmedvideos. In ICCV, 2017. 1, 2

[28] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scalevideo prediction beyond mean square error. In ICLR, 2016.2

[29] A. Miech, I. Laptev, and J. Sivic. Learnable poolingwith context gating for video classification. arXiv preprintarXiv:1706.06905, 2017. 1, 5

[30] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using deep networks in atarigames. In NIPS, 2015. 2

[31] S. L. Pintea, J. C. van Gemert, and A. W. Smeulders. Dejavu. In ECCV, 2014. 2

[32] B. A. Plummer, M. Brown, and S. Lazebnik. Enhancingvideo summarization via vision-language embedding. InCVPR, 2017. 1

[33] N. Rhinehart and K. M. Kitani. First-person activity fore-casting with online inverse reinforcement learning. In ICCV,2017. 2

[34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,A. C. Berg, and L. Fei-Fei. ImageNet Large Scale VisualRecognition Challenge. IJCV, 2015. 5

[35] M. S. Ryoo. Human activity prediction: Early recognition ofongoing activities from streaming videos. In ICCV, 2011. 1,2

[36] Y. Shi, B. Fernando, and R. Hartley. Action anticipation withrbf kernelized feature mapping rnn. In ECCV, 2018. 1, 2

[37] Z. Shou, H. Gao, L. Zhang, K. Miyazawa, and S.-F. Chang.Autoloc: Weaklysupervised temporal action localization inuntrimmed videos. In ECCV, 2018. 1, 2

[38] K. Simonyan and A. Zisserman. Two-stream convolutionalnetworks for action recognition in videos. In ICLR, pages568–576, 2014. 1, 2

[39] K. K. Singh and Y. J. Lee. Hide-and-seek: Forcing a networkto be meticulous for weakly-supervised object and action lo-calization. In ICCV, 2017. 1, 2

[40] T. Suzuki, H. Kataoka, Y. Aoki, and Y. Satoh. Anticipatingtraffic accidents with adaptive loss and large-scale incidentdb. In CVPR, 2018. 2

[41] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, andM. Paluri. A closer look at spatiotemporal convolutions foraction recognition. In CVPR, 2018. 1, 2, 5

[42] G. Varol, I. Laptev, and C. Schmid. Long-term TemporalConvolutions for Action Recognition. PAMI, 2017. 1, 2

[43] C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipatingvisual representations from unlabeled video. In CVPR, 2016.1, 2, 5, 6, 7

[44] C. Vondrick and A. Torralba. Generating the future with ad-versarial transformers. In CVPR, 2017. 2

[45] J. Walker, K. Marino, A. Gupta, and M. Hebert. The poseknows: Video forecasting by generating pose futures. InICCV, 2017. 2

[46] H. Wang and C. Schmid. Action Recognition with ImprovedTrajectories. In ICCV, 2013. 1, 2

[47] L. Wang, Y. Xiong, D. Lin, and L. Van Gool. Untrimmednetsfor weakly supervised action recognition and detection. InCVPR, 2017. 1, 2

[48] X. Wei, P. Lucey, S. Vidas, S. Morgan, and S. Sridharan.Forecasting events using an augmented hidden conditionalrandom field. In ACCV, 2014. 2

[49] T.-Y. Wu, T.-A. Chien, C.-S. Chan, C.-W. Hu, and M. Sun.Anticipating daily intention using on-wrist motion triggeredsensing. In ICCV, 2017. 2

[50] Y. Xu, Z. Piao, and S. Gao. Encoding crowd interaction withdeep neural network for pedestrian trajectory prediction. InCVPR, 2018. 2

[51] T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynam-ics: Probabilistic future frame synthesis via cross convolu-tional networks. In NIPS, 2016. 2

[52] K.-H. Zeng, S.-H. Chou, F.-H. Chan, J. C. Niebles, andM. Sun. Agent-centric risk assessment: Accident anticipa-tion and risky region localization. In CVPR, 2017. 2

[53] Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin.Temporal action detection with structured segment networks.2017. 1, 2

[54] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba.Places: A 10 million image database for scene recognition.TPAMI, 2017. 5


Recommended