1 Toyota Smarthome Untrimmed: Real-World Untrimmed Videos ...

1

Toyota Smarthome Untrimmed: Real-WorldUntrimmed Videos for Activity Detection

Rui Dai, Srijan Das, Saurav Sharma, Luca Minciullo, Lorenzo Garattoni, Francois Bremond, Gianpiero Francesca

Abstract—Designing activity detection systems that can be successfully deployed in daily-living environments require datasets thatcharacterizes the challenges typical of real-world settings. In this work, we introduce a new untrimmed daily-living dataset that featuresseveral real-world challenges: Toyota Smarthome Untrimmed (TSU). TSU contains a wide variety of activities performed in aspontaneous manner. Activities are collected in real-world settings, which results in non-optimal viewpoints. The dataset containsdense annotations including elementary, composite activities and activities involving interaction with objects. We provide an analysis ofthe real-world challenges featured by TSUdataset, highlighting the open issues for detection algorithms. We show that the currentstate-of-the-art methods fail to achieve satisfactory performance on TSU dataset. We release the dataset for research use athttps://project.inria.fr/toyotasmarthome

Index Terms—untrimmed videos, activity detection, activities of daily living, real-world settings.

F

1 TOYOTA SMARTHOME DATASET

This work aims at building a large scale dataset with daily-living activities performed in a natural manner. Activitiesperformed in a spontaneous manner lead to many real-world challenges that are often ignored by the vision com-munity. This includes low inter-class due to presence ofsimilar activities and high intra-class variance, low cameraframing, low resolution, long tail distribution of activitiesand occlusions. To this end, we propose Toyota SmarthomeUntrimmed dataset, which provides spontaneous activitieswith rich and dense annotations to address the detection ofcomplex activities in real-world scenarios.

1.1 Data collection1.1.1 Collection SetupWe use 7 Microsoft Kinect sensors in the recording phase.The apartment plan and camera locations are shown inFig. 5. Cameras 1 and 2 cover the dinning room area, 4 and5 the living room, 3, 6 and 7 the kitchen. Thus, we have acoverage over the entire apartment from at least 2 distinctviewing angles. The videos are recorded at 20 frames persecond, the size of RGB is VGA (640×480), the standardresolution in most real-world scenarios. The dataset offers3 modalities: RGB, depth and 3D skeleton (i.e. pose) (seefig. 2).

For the 3D skeletons, we fine-tune LCR-Net++ [3] on thisdataset and then extract the 2D skeletons. Finally these 2Dskeletons are processed through VideoPose3D [4] to extractthe 3D skeletons. We observe that this mechanism extracts

• R. Dai, S. Das. S. Sharma and F. Bremond are with Inria and Universitecote d’zaur, 2004 Route des Lucioles, 06902 Valbonne, France.E-mail: {rui.dai, srijan.das, saurav.sharma, francois.bremond}@inria.fr

• L. Minciullo, L. Garattoni . and G. Francesca are with Toyota MotorEurope, Hoge Wei 33, B - 1930 Zaventem, Belgium.

3D poses of better quality compared to those obtained usingdepth or LCRNet++.

1.1.2 Collection protocolOne of the key applications of daily-living activity detectionis older patient monitoring. Thus, in our dataset, we invited18 volunteers to our dataset recording sessions. The age ofthe volunteers ranges between 60 and 80 years old. Eachvolunteer was recorded for 8 hours in one day starting frommorning at 9 a.m. until afternoon at 5 p.m.. On the day ofrecording, the volunteer arrived in the apartment at 8 a.m.and had a visit to get acquainted with the place and thehousehold equipment such as coffee machine, television, re-mote control, etc.. The volunteers also received an informaldescription of what it was expected with reference to havingmeals and interacting with anything in the apartment. Theidea was to create a picture of a normal day at home. Nofurther guidance was provided about how the activitiesshould be performed.

In total, we recorded more than 1000 hours of videodata. Based on these data we prepared two datasets: ToyotaSmarthome dataset [5], previously published, and ToyotaSmarthome Untrimmed dataset that is introduced in thispaper.

1.2 Toyota Smarthome Trimmed datasetToyota Smarthome Trimmed [5] has been designed for theactivity classification task. It consists of 16K short RGB+Dclips of 31 activity classes. Each clip is about 12.5 sec. longand contains only one activity. Unlike previous datasets [6],[7], activities were performed in a natural manner. As aresult, the dataset poses a unique combination of chal-lenges: high intra-class variation, high class imbalance, andactivities with similar motion and high duration variance.Activities were annotated with both coarse and fine-grainedlabels. These characteristics differentiate Toyota SmarthomeTrimmed from other datasets for activity classification.

https://project.inria.fr/toyotasmarthome

2

Cut bread Smear butter/jam Take ham

Composite activity: Have Breakfast

Cut bread Spread butter Drink from cup Eat at table

Use fridge

Drink from bottle Drink from can

Object-based activities

Multi-view

Concurrent activities

Use Telephone

Walk

Use Telephone

Write

Eat bread

Read

Watch TV

Use laptop

High temporal variation

5 mins

Use tablet

10 minsUse glasses Write

3 seconds

Camera framing

Ch

all

eng

es

Stir the pot

Camera 6 Camera 3 2 seconds

Inter-class Intra-class

SmarthomePKUMMD

SmarthomeCharades

Use drawer

Use telephone

Cut bread

Sp

on

tan

eou

s b

eha

vio

rs

(ii)

(i)

(iii)

(1)

(2)

(3)

(4)

(5)

(6)

Fig. 1: Overview of the challenges in TSU. On the left part, we present challenges related to spontaneous behaviors:For the first two examples, we present the activity following a strict script on the left, and the same activity performedspontaneously in TSU on the right: (i) Different from using drawer performed quickly, once per video [1], in TSU, usingdrawer may be repeated several times in a video, and the subject may keep several drawers open at the same time tofacilitate finding things. (ii) In [2], subject uses shortly the telephone while looking at the camera. In contrast in TSU,the subject is deeply involved with his telephone and the activity may last several minutes instead of few seconds. (iii) InTSU, subject may stayed seated or stand up to cut the bread in an easier manner. Besides the spontaneous behaviors, wealso illustrate on the right part the following real-world challenges: (1) Camera framing: subject is not in the middle ofthe image and can be even outside the field of view. (2) Object-based activities: similar activities can be performed whileinteracting with different objects. (3) Multi-views: activities look different in appearance from different view points. (4)Composite activity: composite activities can be split into several elementary activities (e.g. While having breakfast, we maycut bread, spread butter and eat at the table). Moreover, these complex composite activities can last a long period of time.Large variations of appearance make the recognition challenging, requiring to understand the composition of elementaryactivities to better recognize the composite activities. (5) Concurrent activities: activities can be performed concurrently(e.g. Take note while having a phone call). (6) High temporal variation: in the same untrimmed video, we may have relatedshort activities (e.g. taking on glasses) and long activities (e.g. using tablet). Different instances of the same activity classcanalso be short or long (e.g. writing) corresponding to high intra-class temporal variance.

1.3 Toyota Smarthome Untrimmed dataset

Toyota Smarthome Untrimmed and Toyota SmarthomeTrimmed [5] are obtained from the same recording footage.Different from the Toyota Smarthome Trimmed, TSU is tar-geting the activity detection task in long untrimmed videos.Therefore, in TSU, we kept the entire recording when theperson is visible. The dataset contains 536 videos with anaverage duration of 21 mins. Since this dataset is based onthe same recording as Toyota Smarthome Trimmed version,it features the same challenges and introduces additionalones. In section 1.3.1, we describe the annotation protocol.Then, we present the properties of the TSU dataset insection 1.3.2, we present its challenges in section 1.3.3, andfinally we compare this untrimmed version of the dataset(i.e. TSU) with its trimmed version in section 1.3.4.

1.3.1 Annotation protocol

TSU is designed particularly for the activity detection task.With the support of a medical staff, we have identified51 activities of interest to annotate. A team of annota-tors manually annotated the videos using the open-sourcetoolkit ELAN [8]. The videos were annotated individually

without relying on the fact that some camera views overlap.The annotation process took more than 6 months, includ-ing verification and quality checks. We performed qualitycheck by 5 annotators. We estimated the correctness of theannotation by considering the accuracy of same 50 longvideos annotated by different annotators. Additionally, wereviewed, normalized and corrected the 25 hours of annota-tion by checking again the videos where the methods wereachieving low activity detection performance. Fig. 3 showsan example of the annotation. This example correspondsto composite activity cooking. While cooking, the subject

(1) RGB (3) Depth (2) 3D Skeleton

Fig. 2: Available modalities in Toyota SmarthomeUntrimmed. Note: in the sub-figure of RGB modality, wealso mark the 2D skeleton joints.

3

Cut vegetable

Cook

Put sth. on table

Use drawer

Walk

Get water

Walk

Put sth. on table

Cut vegetableUse stove

WalkWalk Walk

Dump in trash

Take sth. off table

1:21 1:25 1:26 1:29 1:31 1:34 1:35 1:38 1:41 1:43 1:45

Fig. 3: An example of annotation on TSU dataset. ’←’ and ’→’ indicate respectively the start and end of an activity.

abruptly stops cutting vegetables and starts heating waterin a pot so that she can have boiled water after cuttingthe vegetables. After setting up the stove, she resumescutting the vegetables. This process does not follow a stricttemporal order and reflects the spontaneous behavior of theparticipant.

1.3.2 Dataset Properties

The result of the extensive annotation process is a richcorpus of activities. Fig. 4 presents the diversity of activitiesin this dataset. The activities are categorized into compos-ite and elementary activities. Composite activities are thecomplex activities that are composed of several elementaryactivities that may or may not follow a temporal ordering.TSU contains 5 composite activities which are relativelylong. Elementary activities are atomic activities which maybe performed concurrently in time. These activities mayor may not be part of a composite activity. TSU contains46 elementary activities and these activities may be longor short. In Fig. 4 (c), we illustrate the composite activitycooking, with its elementary activities. In Fig. 4 (a) and (b),the composite and its corresponding elementary activitiesare marked with the same color.

TSU contains a rich diversity of elementary activities.We present three challenging scenarios that might occurwhile attempting to recognize these activities. Firstly, thedataset contains pose-based activities for which poses couldbe sufficient for classification. In contrast, the appearanceinformation may not improve the recognition of these activ-ities. In Fig. 4 (d), we provide 8 such pose-based activities.For example, sit down only needs the 3D poses to be distin-guished, whereas the books and laptop around the subjectmay mislead an appearance-based classifier to recognize anactivity related to those objects, such as reading. Secondly,TSU contains many elementary activities characterized bysimilar motions and interactions with objects. These objectsprovide strong clues to distinguish an activity. However, areliable detection of the object while processing the wholevideo is a challenge. Sometimes, the objects are occludedwithin the hands of the subject, like in the case of graspinga cup while drinking. As a result, these activities with similarmotion are often miss-classified amongst each other. InFig. 4 (e), we provide 22 such activities. For example, thesubjects performing use fridge and use cupboard have verysimilar poses. A fine understanding of the object informa-tion (e.g. fridge and cupboard) may facilitate the recognitionof these activities. Finally, the dataset contains fine-grainedactivities characterized by subtle motions, which presents

additional challenges for the recognition task. In Fig. 4 (f),we describe 7 such activities. For example, subjects whoperform the activity Stir coffee/tea move only slightly theirwrist and forearm. Compared to activities with pronouncedmotions, such as sitting down, learning discriminative rep-resentations for these activities with subtle motions is verychallenging.

We further analyze the distribution of the activities inTSU in Fig. 5. We first provide a pictorial representation ofthe apartment along with the camera placements. TSU fea-tures multi-view settings, as all the activities are capturedby more than one camera. Then, we provide 6 statisticspertaining to the activity distribution in the dataset. Fig. 5(a) depicts a distribution of activity instances across thedifferent rooms. Most activities occur in the living room,then kitchen and dinning room. This is similar to real lifedistribution as we spend most of our time in the livingroom. Correspondingly, Fig. 5 (f) presents the distributionof environment for each activity. We find that 51% of theactivities are environment independent. For instance, wecan eat snack or use laptop in all these three environments.However, activities that rely on specific equipment occur inthe same environment, such as using oven in the kitchen.Fig. 5 (b) shows the activity distribution across the activityduration. We find that in TSU, most activities are shortactivities, followed by medium and long activities. Thisis because long activities have few occurrences but longerduration. Interestingly, short activities are often more chal-lenging to detect compared to the longer ones [9]. Fig. 5(c) shows the distribution of activities based on their intra-class temporal variance. We notice that 22% of the activitieshave high temporal variance (i.e. vary more than 500 sec.).Correspondingly, Fig. 5 (e) provides the heat map of thetemporal variance of these activities. The lighter grey meansthat the temporal variance is higher. Such intra-class vari-ance within the same activity class further complicates thetask of detection. Finally, Fig. 5 (d) provides the occurringfrequency for every activity in the dataset. We have anon-uniform distribution of activities following the Zipf’slaw [10].

1.3.3 Challenges

TSU provides the following 7 real-world challenges. (1)Spontaneous behavior: TSU is an untrimmed ADL datasetwhere people are recorded while performing activities in aspontaneous manner. This property defines the uniquenessof TSU dataset. (2) Low camera framing: because of the longduration of the recording, the subjects do not pay attention

4

(a) Composite (b) Elementary

(e) Similar motion/activity

Pour coffee grain Add water to machine Use fridge Use cupboard

Stir the pot Stir the coffee/tea Read Write

Cut bread Cut vegetables/meat Put sth. on table Put sth. in sink

Take pills Eat snacks Take sth. off table

Drink from cup Drink from bottle Pour from kettle Pour from bottle

Drink from glasses Drink from can Pour from can

(f) Subtle motion

Use laptop

Use tablet

Write

Stir coffee/tea

Boil water

Use stove

Spread butter/jam

Use fridge Use cupboard

Cut

veg

etab

les

Sti

r th

e pot

Use

sto

ve

Use

oven

Cook

(c) High related

Composite & Elementary

Get up

Lay down

Walk

Sit down

Enter

Leave

Put sth. on table

Take sth. off table

(d) Pose-based activities

Fig. 4: On the top row, we divide the 51 activities in TSU into (a) composite and (b) elementary activities. Then, we analyzethe activities along four properties: (c) highly related composite and elementary activities, (d) pose-based activities, (e)similar motion/activities, and (f) activities with subtle motion.

to the fixed cameras. Therefore, activities can be performedvery far, very close or out of view of the camera. Activitiescan also be partially occluded by furniture. (3) Object-basedactivities: The annotations in TSU include the fine-graineddetails of activities performed using different objects (e.g.,drinking from a cup, can or bottle). TSU contains 7 object-basedactivities. (4) Multi-views: TSU features 7 camera views. Toevaluate the robustness of detection methods to differentcamera views, we provide a cross-view evaluation protocol.(5) Composite activities: TSU contains 5 composite activityclasses and 16 related elementary activity classes. (6) Con-current activities & dense annotation: TSU contains up to4 concurrent activities for a single frame. About 10% of theframes contains more than one activity label. On an average,there are about 76 activity instances per video. (7) Hightemporal variance: This new dataset offers a large inter-class and intra-class temporal variance. TSU features shortactivities (e.g. taking on glasses), long activities (e.g. readingbook), and instances of the same class that can be long orshort (e.g. writing ranges from 3 seconds to 10 minutes). Asa result, handling temporal information is critical to achievegood detection performance on TSU.

1.3.4 Toyota Smarthome Trimmed Vs Untrimmed datasetAs shown in Table 1, in contrast to the previous version ofdataset, TSU is 1.6 times larger in activity classes, 2.8 timeslarger in activity instances, and 3.5 times larger in total num-ber of frames. The key features that are introduced in theUntrimmed version of the dataset compared to its trimmedversion are: (1) concurrent activities (e.g. take note while

TABLE 1: Comparison with two versions of Toy-ota Smarthome.

Dataset Smarthome SmarthomeVersion Trimmed [5] UntrimmedTask Recognition Localization#Classes 31 51#Instances 16 K 41 K#Frames 3.9 M 13.8 M

using telephone), (2) activities with high intra-class temporalvariance. (3) Long composite activities (e.g. while cleaningdishes: put dishes in sink→clean with water→dry up), (4) Morevariety of spontaneous behavior in untrimmed videos (e.g.Fig 1, finding things in different drawers before succeeding).

1.3.5 Benchmark EvaluationIn TSU, we define 2 evaluation protocols: Cross-Subject andCross-View. We provide also two evaluation metrics (frame-based and event-based mAP). For frame-based evaluation,we adapt the protocol of [11] to evaluate the same mAPmetric on single frames. This way of evaluating detection isrobust to annotation ambiguity. For event-based evaluation,we adapt the protocol of [2]. This metric enables us to get abetter insight into activity detection as not biased by activityduration.Cross-Subject (CS): For cross-subject evaluation, we splitthe 18 subjects into training and test sets. To balance thenumber of videos for each activity category, we use 11subjects for training and the 7 remaining ones for testing.This protocol considers all the 51 activities.

5

(e)

C1

C2

C3

C6 C7

C4

C5

Dining room

Kitchen

Living room42 %

23 %

35 %

Long (>30s)

Medium (10s<t≤30s)

6000

0

100%

25%

0%

(d)4000

2000

(f)75%

50%

Dining room (C1, C2)

Kitchen (C3, C6, C7)

Living room (C4, C5)

Environment Activity duration Temporal Var. in sec.

High (>500)

Medium (50<Var.≤500)

85 %

49 %

Wal

k

Take

sth

. off

tab

le

Pu

t st

h. o

n t

able

Dri

nk.

Fro

m c

up

Get

up

Sit

do

wn

Rea

d

Wat

ch T

V

Ente

r

Eat

at t

able

Use

Dra

wer

Leav

e

Stir

th

e p

ot

Use

lap

top

Use

tel

eph

on

e

Cle

an d

ish

es

Wri

te

Dry

up

Use

cu

pb

oar

d

Take

pill

s

Dri

nk.

Fro

m b

ott

le

Dri

nk.

Fro

m c

an

Eat

snac

k

Po

ur.

Fro

m b

ott

le

Use

ove

n

Use

gla

sses

Du

mp

in t

rash

Use

tab

let

Use

fri

dge

Cu

t m

eat/

vege

tab

les

Lay

do

wn

Use

sto

ve

Wip

e ta

ble

Co

ok

Po

ur.

Fro

m k

ett

le

Cle

an w

ith

wat

er

Cu

t b

read

Inse

rt t

ea b

ag

Pu

t st

h. i

n s

ink

Spre

ad ja

m o

r b

utt

er

Ad

d w

ater

to

mac

hin

e

Dri

nk.

Fro

m g

lass

Get

wat

er

Mak

e co

ffee

Po

ur.

Fro

m c

an

Po

ur

coff

ee g

rain

s

Bo

il w

ate

r

Stir

co

ffee

/te

a

Bre

akfa

st

Mak

e te

a

Take

ham

62675798 5760

25412148 2139

1658 14711021 983 966 922 775

488 481 471 463 461 440 383 368 355 351 334 320 317 284 265 245 234 214 202 198 129 129 110 97 94 93 89 81 72 72 68 66 65 58 54 36 26 20

26 %

22 %

52 %60 %

Low (≤50)Short (≤10s)

16 %

24 %

(a) (b) (c)

Fig. 5: On top row (from left to right): we provide the 7 camera locations (C: camera); activity distribution along thedifferent (a) environments, (b) duration and (c) temporal variance. Remark: (a) is per activity instance, (b),(c) are peractivity class. On bottom row: we provide the (d) instance frequency and corresponding (e) temporal variance heat map(e.g. the lighter the larger variance), (f) distribution of performing environment for each activity.

Cross-View (CV): For cross-view evaluation, the training setcontains the videos from cameras 1, 3, 4, 6, 7. The remainingcameras (2, 5) are reserved for testing. The training setcontains all the 51 activities and the testing set contains 32activities from these two camera views.

2 EXPERIMENTS

The goal of these experiments is to verify that theTSU dataset provides the novel challenges that are not yetaddressed by the state-of-the-art algorithms. We evaluate 9popular methods on TSU dataset, which represent the state-of-the-art on other densely-annotated datasets [1], [12]. Wealso perform a comparative study between TSU and thechallenging Charades dataset along the activity detectiontask to better highlight how real-world challenges are char-acterized in both the datasets.

2.1 Implementation details2.1.1 Video encodingWe use three types of encoders to extract the encoding ofthe input videos. For AGCN [13] and I3D [14] (pre-trainedon Kinetics [15]), we fine-tune them on TSU and then thefeatures are extracted. Besides, we also evaluate this dataseton per-frame feature. We use Inception V1 [16] pre-trainedon ImageNet [17] to extract the features. The channel size ofI3D and Inception V1 is 1024, channel size of AGCN is 256.

2.1.2 State-of-the-art methodsNine activity detection methods are evaluated on ourdataset, namely, bottleneck, Non-local network [18],LSTM [19], Bidirectional-LSTM [20], Dilated-TCN [21], R-I3D [22], Super-event [23], TGM [24] and MS-TCN [25]. The

method using Bottleneck has only one dropout layer (withdropout probability 0.5) followed by a bottleneck layer asthe classifier. Non-local [18] has one non-local block appliedon the features of the whole video before the classifier.LSTM [26] has one LSTM layer with 512 hidden units andone dropout layer (with dropout probability 0.5). Similarly,for Bidirectional-LSTM [20], we have two opposite direction512 hidden units LSTM layers. The features are concatenatedbefore the classifier. R-I3D [27] uses I3D [14] as its SD-TCN.We set the anchor scale value to [0.3, 0.6, 1.0, 1.5, 2, 2.5, 2.75,3, 3.5, 4, 4.5, 5,5.5, 6, 6.5, 7, 7.5, 8, 10, 12, 14, 16, 18, 20, 24, 28,32, 38, 42, 50, 58, 66, 78, 84, 90, 96]. For TGM [24], we addone layer to have a 4-layer structure. All the methods use thesame video encoding and they are trained with binary cross-entropy loss with sigmoid activation [28]. The unspecifiedparameters are similar to the original papers.

2.2 Comparative study on TSUTable 2 provides the results of the considered activity de-tection methods on TSU. To be noticed, the Bottleneck (i.e.Baseline in the tables) used for comparison is actually abottleneck layer on top of the segment-level features. Unlikethe other baseline methods, Bottleneck does not have furthertemporal processing after the video encoding part. Thus,this method cannot effectively model the temporal infor-mation, which is crucial for activity detection. In contrast,the other activity detection baselines focus on the temporalprocessing. The improvement over the Bottleneck reflectsthe effectiveness of modeling temporal information.

2.2.1 State-of-the-art baseline resultsIn Table 2, we then compared the performance of therepresentative baselines on TSU. The comparative study is

6

TABLE 2: Per-frame mAP on TSU dataset.

CS CVAGCN+Bottleneck [13] 10.1 12.6AGCN+LSTM [26] 17.0 14.8Inception+Bottleneck [16] 11.5 5.2Inception+LSTM [26] 13.2 5.3R-I3D [27] 8.7 -I3D+Bottleneck [14] 15.7 9.2I3D+Non-local block [18] 16.8 9.6I3D+Super event [23] 17.2 10.9I3D+LSTM [29] 22.6 12.9I3D+Bidirectional-LSTM [20] 24.5 15.1I3D+Dilated-TCN [21] 25.1 13.9I3D+MS-TCN [25] 25.9 13.1I3D+TGM [24] 26.7 13.4

TABLE 3: Event-based mAP (%) for different IoUthresholds for the TSU dataset. Note that, theinput are I3D feature from RGB stream.

CS CVIoU Threshold (θ) 0.3 0.5 0.7 0.3 0.5 0.7

Bottleneck [14] 5.0 2.5 0.5 2.3 1.1 0.2Non-local block [18] 4.9 2.2 0.6 1.6 0.7 0.1

Super event [23] 5.7 2.8 0.7 1.8 0.9 0.1LSTM [26] 11.6 6.4 2.2 6.0 3.2 0.7

Bidirectional-LSTM [20] 13.3 7.9 3.5 9.0 5.4 1.2Dilated-TCN [21] 12.8 6.9 3.0 5.8 3.3 0.8

MS-TCN [25] 13.2 7.6 3.0 5.3 3.1 0.4TGM [24] 15.1 9.4 4.2 5.5 3.2 0.4

conducted with the I3D RGB features. The first method isa proposal-based method that adopts R-C3D [27] with I3Dbase network (we call this method R-I3D). This method failsto generate precise proposals for long activities with denselabels due to high computational cost. Consequently, ityields the worst detection performance on TSU. The secondand the third methods are the Bottleneck [14] and the Non-local block [18]. We find that the non-local block can providethe information of one-to-one temporal dependency to thelocal features (+ 0.9% w.r.t. Bottleneck on TSU-CS), however,Non-local block is not effective enough. Similarly, Super-event [23] utilizes temporal structure filters to model latentrepresentation of composite activities and then computetheir affinity with each frames (+4.2% w.r.t. Bottleneck onTSU-CS). However, videos in TSU are long and complex,thus it is hard to model latent representation of compositeactivities in this dataset. We need the temporal filter togradually embed the information of the local frames tothe current frame. LSTM [26] and Bidirectional-LSTM [20]are RNN based methods. These methods can model shorttemporal relations (up to +8.8% w.r.t. Bottleneck on TSU-CS), but fail to model the long temporal relationships inthe complex activities of TSU. Dilated-TCN [21], TGM [24],MS-TCN [25] use temporal Gaussian/Convolutional filterswhich better capture the temporal relationships in longactivities (up to +13.5% w.r.t. Bottleneck on TSU-CS). Thanksto the effective temporal filters, these methods can processlong-term temporal relations.

In table 3, we present the event-based evaluation of thebaselines. The overall low performance indicates that cur-rent methods are far from addressing real-world situations.

3 CONCLUSION

In this paper, we introduce a novel untrimmed dataset thatfeatures spontaneous behaviors and several real-worldchallenges for activity detection: Toyota SmarthomeUntrimmed (TSU). Our comparative study showed that theactivity detection performance on TSU is still low, highlight-ing the remaining open issues related to real-world condi-tions. For this reason, TSU dataset is licensed for academicresearch purposes. This will allow researchers to developnovel approaches to promote activity detection in the wild.To learn more about Toyota Smarthome Untrimmed datasetplease visit the project website1.

REFERENCES

[1] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, andA. Gupta, “Hollywood in homes: Crowdsourcing data collectionfor activity understanding,” in European Conference on ComputerVision(ECCV), 2016.

[2] C. Liu, Y. Hu, Y. Li, S. Song, and J. Liu, “Pku-mmd: A largescale benchmark for continuous multi-modal human action un-derstanding,” arXiv preprint arXiv:1703.07475, 2017.

[3] G. Rogez, P. Weinzaepfel, and C. Schmid, “LCR-Net++: Multi-person 2D and 3D Pose Detection in Natural Images,” IEEETransactions on Pattern Analysis and Machine Intelligence, 2019.

[4] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli, “3d humanpose estimation in video with temporal convolutions and semi-supervised training,” in Conference on Computer Vision and PatternRecognition (CVPR), 2019.

[5] S. Das, R. Dai, M. Koperski, L. Minciullo, L. Garattoni, F. Bremond,and G. Francesca, “Toyota smarthome: Real-world activities ofdaily living,” in The IEEE International Conference on ComputerVision (ICCV), October 2019.

[6] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+d: Alarge scale dataset for 3d human activity analysis,” in The IEEEConference on Computer Vision and Pattern Recognition (CVPR), June2016.

[7] J. Wang, X. Nie, Y. Xia, Y. Wu, and S.-C. Zhu, “Cross-view actionmodeling, learning, and recognition,” in 2014 IEEE Conference onComputer Vision and Pattern Recognition, June 2014, pp. 2649–2656.

[8] M. P. Institute, “Tla software: Elan,” https://tla.mpi.nl/tools/tla-tools/elan/, accessed Oct. 30th, 2019.

[9] G. A. Sigurdsson, O. Russakovsky, and A. Gupta, “What actionsare needed for understanding human actions in videos?” in Inter-national Conference on Computer Vision (ICCV), 2017.

[10] G. V. Horn and P. Perona, “The devil is in the tails: Fine-grainedclassification in the wild,” CoRR, vol. abs/1709.01450, 2017.[Online]. Available: http://arxiv.org/abs/1709.01450

[11] G. A. Sigurdsson, S. Divvala, A. Farhadi, and A. Gupta, “Asyn-chronous temporal fields for action recognition,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,2017, pp. 585–594.

[12] S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, andL. Fei-Fei, “Every moment counts: Dense detailed labeling ofactions in complex videos,” International Journal of Computer Vision,vol. 126, no. 2-4, pp. 375–389, 2018.

[13] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graphconvolutional networks for skeleton-based action recognition,” inProceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2019, pp. 12 026–12 035.

[14] J. Carreira and A. Zisserman, “Quo vadis, action recognition? anew model and the kinetics dataset,” in 2017 IEEE Conference onComputer Vision and Pattern Recognition (CVPR). IEEE, 2017, pp.4724–4733.

[15] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya-narasimhan, F. Viola, T. Green, T. Back, P. Natsev et al., “The kinet-ics human action video dataset,” arXiv preprint arXiv:1705.06950,2017.

[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper withconvolutions,” in Proceedings of the IEEE conference on computervision and pattern recognition, 2015, pp. 1–9.

1. https://project.inria.fr/toyotasmarthome

https://tla.mpi.nl/tools/tla-tools/elan/

https://tla.mpi.nl/tools/tla-tools/elan/

http://arxiv.org/abs/1709.01450

https://project.inria.fr/toyotasmarthome

7

[17] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima-geNet: A Large-Scale Hierarchical Image Database,” in CVPR09,2009.

[18] X. Wang, R. B. Girshick, A. Gupta, and K. He, “Non-local neu-ral networks,” 2018 IEEE/CVF Conference on Computer Vision andPattern Recognition, pp. 7794–7803, 2018.

[19] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. [Online].Available: http://dx.doi.org/10.1162/neco.1997.9.8.1735

[20] A. Graves and J. Schmidhuber, “Framewise phoneme classificationwith bidirectional lstm and other neural network architectures,”Neural networks, vol. 18, no. 5-6, pp. 602–610, 2005.

[21] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager,“Temporal convolutional networks for action segmentation anddetection,” in proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2017, pp. 156–165.

[22] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri,“Learning spatiotemporal features with 3d convolutionalnetworks,” in Proceedings of the 2015 IEEE International Conferenceon Computer Vision (ICCV), ser. ICCV ’15. Washington, DC, USA:IEEE Computer Society, 2015, pp. 4489–4497. [Online]. Available:http://dx.doi.org/10.1109/ICCV.2015.510

[23] A. Piergiovanni and M. S. Ryoo, “Learning latent super-events todetect multiple activities in videos,” in International Conference onMachine Learning (ICML), 2018.

[24] ——, “Temporal gaussian mixture layer for videos,” in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,2019.

[25] Y. A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convolu-tional network for action segmentation,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2019, pp.3575–3584.

[26] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[27] H. Xu, A. Das, and K. Saenko, “R-c3d: Region convolutional 3dnetwork for temporal activity detection,” in Proceedings of the IEEEinternational conference on computer vision, 2017, pp. 5783–5792.

[28] J. Nam, J. Kim, E. L. Mencıa, I. Gurevych, and J. Furnkranz,“Large-scale multi-label text classification—revisiting neural net-works,” in Joint european conference on machine learning and knowl-edge discovery in databases. Springer, 2014, pp. 437–452.

[29] B. Mahasseni and S. Todorovic, “Regularizing long short termmemory with 3d human-skeleton sequences for action recogni-tion,” in Proceedings of the IEEE Conference on Computer Vision andPattern Recognition, 2016, pp. 3054–3062.

http://dx.doi.org/10.1162/neco.1997.9.8.1735

http://dx.doi.org/10.1109/ICCV.2015.510

Date post:	05-Jan-2022
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

1 Toyota Smarthome Untrimmed: Real-World Untrimmed Videos ...

Documents