VCI R at the NTCIR-13 Lifelog-2 LSAT...

VCI2RattheNTCIR-13Lifelog-2LSATTaskPresentedby:QianliXu

Co-authors:JieLin,AnadelMolino,QianliXu,FenFang,V.Subbaraju,Joo-HweeLim,LiyuanLi,V.ChandrasekharOrganization:InstituteforInfocommResearch,A*STAR,

Singapore

AboutVCI2R•  InstituteforInfocommResearch

(I2R),A*STAR,Singapore–  VisualComputing–  HumanLanguageTech–  DataAnalytics–  NeuralBiomedicalTech–  etc.

•  VisualComputingDepartment–  Video/imageanalytics&search–  Augmentedvisualintelligence–  Visualinspection

Website:www.a-star.edu.sg/i2r/

QueryTopic

Object Classifier

PlacesClassifier

ObjectDetector

NTCIR-13Classifier

Time tag

Loc tag

# People

LifelogImages

Training Images

CNN

Faster RCNN

User-given

Online

Offline

w1

w2

w3

w4

w5

w6

w7

Feature weight Relevant concepts

TemporalSmoothing

… …

LSATFrameworkImage+Metadata

QueryTopics SemanticGap

•  Relevantconcepts:WhataretheCNNpredicationsrelevanttoquerytopics?

•  Featureweighting:Whichfeaturescontributethemost?

•  Temporalsmoothing:Temporalcoherence,removeoutliers

•  Postfiltering:refinesearchusinglocation(GPS)andTime

“Castle@Night”“Workinginacoffeeshop”“Gardeninginmyhome”

delMolino,etal.,2017,VC-I2RatImageCLEF2017:Ensembleofdeeplearnedfeaturesforlifelogvideosummarization.CLEFWorkingNotes,CEUR.

1.GettingtheBasicSemantics•  CNNclassifiers

–  Object:ResNet152–ImageNet1K–  Place:ResNet152–Place365

•  CNNdetector–  FasterR-CNN–MSCOCO(80)

•  NTCIR-13classifier–  VGG-16–ImageNet1K–  Replacethelastlayer(1Kneurons)with634neurons–  Sigmoidastheactivationfunction

•  Humandetectionandcounting–  Sighthound(https://www.sighthound.com)

2.Aggregating&WeighingFeaturesP R F1

0

0.2

0.4

0.6

0.8

1

task 1 (X=400)

q=50; w=2.7.0.0.0.6.0.1; win=1q=50; w=2.7.0.0.0.6.0.10; win=1q=50; w=2.7.0.0.0.6.0.2; win=1q=50; w=2.7.0.0.0.8.0.1; win=1q=50; w=2.7.0.0.0.8.0.10; win=1

P R F1

0

0.2

0.4

0.6

0.8

1

task 2 (X=400)

q=0; w=6.4.−4.5.0.1.4.1; win=1q=0; w=6.4.−4.5.0.1.6.1; win=1q=0; w=6.4.−4.5.0.2.4.1; win=1q=0; w=6.4.−4.5.0.2.6.1; win=1q=0; w=6.4.−4.5.−2.1.4.1; win=1

P R F1

0

0.2

0.4

0.6

0.8

1

task 5 (X=400)

q=35; w=4.8.−3.2.0.0.0.1; win=1q=35; w=4.8.−3.2.0.0.1.1; win=1q=35; w=4.8.−3.2.0.0.2.1; win=1q=35; w=4.8.−5.2.0.0.0.1; win=1q=35; w=4.8.−5.2.0.0.1.1; win=1

P R F1

0

0.2

0.4

0.6

0.8

1

task 6 (X=400)

q=0; w=2.0.0.8.0.1.1.1; win=3q=10; w=2.0.0.8.0.1.1.1; win=3q=0; w=2.0.0.8.0.1.1.2; win=3q=10; w=2.0.0.8.0.1.1.2; win=3q=0; w=2.0.0.8.0.1.2.1; win=3

P R F1

0

0.2

0.4

0.6

0.8

1

task 10 (X=400)

q=35; w=0.0.0.3.0.3.1.1; win=3q=35; w=0.2.0.3.0.3.1.1; win=9q=35; w=0.2.−2.3.0.3.1.1; win=5q=35; w=0.0.−1.3.0.3.1.1; win=3q=35; w=0.2.0.3.0.3.1.1; win=5

Fig. 3: Weight learning from the development set. Description of the parameters:q = quality threshold; w =[wcoco, wobjy, wobjn, wply, wpln, wloc, wact, wppl];win = size of the smoothing window. (Best viewed in color.)

Objects Places MSCOCOTask Relevant Avoid Relevant Avoid Relevant

1 computergroup meeting

- computergroup meeting

etc.

- laptopkeyboard

2 televisionfoodglass

computergroup meeting

living roomtelevision room

etc.

conference roomlecture room

etc.

tvremoteetc.

3 computergroup meeting

o�ce co↵ee shopliving room

etc.

conference roomo�ceetc.

laptopkeyboard

4 computerpencil

notebook

o�ce living roomhotel room

etc.

conference roomo�ceetc.

laptopbooketc.

5 foodglass

drumwhite goods

menu’

food courtrestaurant

etc.

- forksandwich

etc.6 drink

glassbeverage

computer barpubetc

home bottlewine glass

7 - public transport templepalaceetc.

residential neigh.busetc.

-

8 public transport cabcar seattaxi

bus interiorsubway station

etc.

car interior bustrain

9 foodcooking utensilwhite goods

- pantrykitchenetc.

living room ovenrefrigerator

etc.10 shopping

shop- supermarket

storeetc.

shopfrontshopping mall

etc.

-

Table 1: Semantic queries for the retrieval task: Concepts to search in WordNetto find all related (and to avoid) ImageNet classes, manual selection of relevant(and to avoid) places classes, and objects to detect.

CRFforFeatureweighingthataccommodatesindividualdifferences

Relevancemappingforeachtopic

40% of the tested cases, the users consider the summary ob-tained with AVS better than any other summary, includingthe summaries annotated with manual tools by other users.

2 General Overview of Active Video

Summarization

The aim of AVS is to provide a customized summary with aslittle effort as possible from the user side. The system firstasks for the user’s initial preferences, selected from a setof items, i.e. the most frequent items in the original video.Then, the user’s preferences are further refined through aquestion-asking inference.

AVS asks the user specific questions about segments ofthe video. It shows one selected segment, and asks the fol-lowing two binary questions: Would you want this segmentto be in the final summary?, and Would you want to includesimilar segments? Additionally, the user can decide at anytime to go through the segments in the summary, and givesuch feedback about them. Although AVS is not limited tothese two questions, experiments show that they are effectivein practice, and they serve us as a proof of concept. Note thatthe original video is not shown to the user, as the summaryand the segments shown during the interaction provide anaccurate idea of the content of the video in much less time.

Thus, AVS can be divided into two inference problems:(i) infer the customized summary, and (ii) infer the nextquestion to ask. We use a probabilistic approach basedon a Conditional Random Fields (CRFs) to infer the mostlikely summary, and to estimate the next question to ask.CRFs are sound probabilistic models that have been suc-cessfully applied in many computer vision and multimediaproblems (Lafferty, McCallum, and Pereira 2001). In the fol-lowing sections, we introduce CRFs to infer the customizedsummary, and then, the algorithm that infers the questions toask. They are summarized in Alg. 1.

3 Inference of the Customized Summary

Let s = {si} be the set of random variables that representthe summary of the video by indicating whether a segment(or subshot) of the video appears in the summary or not.Thus, si 2 {0, 1}, where si is equal to 1 when the seg-ment is included in the summary, and 0 otherwise. We de-note P (s|✓) as the probability density distribution of howlikely the summary s is preferred by the user. We model thisdistribution with a CRF, and ✓ are the values for the po-tentials of the CRF, that depend on the input video and theuser’s preferences.

A CRF models the probability density with a Gibbs dis-tribution, c.f. (V. and Wainwright 2005). Therefore, P (s|✓)can be written as the normalized exponential of an energyfunction, which is denoted as E✓(s). The energy function isthe sum of a set of potentials, which are functions that take asinput a subset of {si}. The summary of the video, which isdenoted as s?✓ , is obtained by inferring the Maximum a Pos-teriori (MAP), i.e. s?✓ = argmaxs P (s|✓), or equivalently,maximizing the energy function E✓(s).

In the following, we first introduce the potentials of theCRF, and then the algorithm to obtain the MAP summary.

3.1 CRF for Customized Summarization

We follow most methods in the literature, that select repre-sentative and diverse segments with as little motion as pos-sible. To do so, we define the energy function of the CRFas

E✓(s) = �X

i

�u(si)| {z }unary

+X

ij

�p(si, sj)| {z }pairwise

, (1)

where the unary potentials enforce the selection of static seg-ments, the pairwise potentials encourage segments with di-verse semantic content, and � is a parameter that weightsthe unary potentials with respect to the pairwise. There is aunary potential for each segment of the video, and one pair-wise potential for each pair of similar segments. The lengthof the summary is controlled during the inference of theMAP summary by adding additional constraints to the en-ergy function that control the length of the summary, as weshow below.

Next, we introduce the potentials, and we make empha-sis on the update of the potentials when new user’s prefer-ences are known. Note that we omit the dependency of thepotentials on ✓ for simplicity, and the parameters that weintroduce in the following should be considered as part of✓. Also, the values of the parameters of the potentials areintroduced in the implementation details in 5.2.Unary Potentials. The unary potentials, {�u(si)}, encour-age selecting segments that the user will probably like.�u(si) is equal to QiI[si = 1] + LI[si = 0], in which:I[a] is an indicator function that is 1 if a is true and 0 oth-erwise; Qi is a function representing how well that segmentrelates to the requirements individually; and L is a constantoffset that is set during the MAP inference of the summaryin order to adjust the summary length (sec. 3.2).

During the on-line interaction phase, when the user rec-ommends to include a segment si, Qi is increased by � toenforce the selection of that segment, otherwise Qi is de-creased by �.Pairwise Potentials. The pairwise potentials, {�p(si, sj)},are defined between each pair of similar segments, and en-force selecting segments with diverse content.

Let d( i, j) be the Euclidean distance between the de-scriptors of two segments (details in sec. 5.2). The pair-wise potential enforces that similar segments should notbe included in the summary. To do so, we define a po-tential that is weighted by the distance between descrip-tors, i.e. �p(si, sj) = exp (�d( i, j))�

0p(si, sj), in which

�0p(si, sj) enforces that both segments should not be se-

lected at the same time, and the term exp (�d( i, j)) re-duces the effect of �0

p(si, sj) when the segments are dissimi-lar. In this way, only a representative segment among similarsegments is selected.

Specifically, �0p(si, sj) is defined as

�0p(si, sj) =

(L↵ if si = sj = 0

�L� if si = sj = 1� if si 6= sj

, (2)

where � is the cost of selecting only one segment in the pair,↵ and � are the cost to discard or select both segments, re-spectively, and L is a variable parameter that controls the

ImageNet1K

Places365

MSCOCO

NTCIR

Time

#People

Locationtag

TrainingImages

w1w2w3w4w5w6w7w8w9w10

w11

w12

Featureweight Relevantconcepts

3.TemporalSmoothing•  Adjacentlifelogimagesmay

sharesimilarevent.•  Temporalsmoothingisused

toensurethesemanticcoherence.

•  Atriangularwindowofsizewisused.wisadaptivetoeventtopics.

4.Post-filtering•  Increasediversityofretrieved

images(avoidretrievingimagesofthesameevent)

•  Usetimeandlocation(GPS)tofilterimages

•  Excludeimagesthatarecloserintimeandlocation.

Result

•  Officialscore(precision):57.6%

0

0.2

0.4

0.6

0.8

1

mAP

Eat Lunch

Gardening

Castle at N

ight

CoffeeSunset

Graveyard

Lecturing

Shopping

Working Late

On Computer

CookingFlying

Juice

Photo of Sea

Beers in Bar

Greek Amphit

TV Recording

Work w Coffee

Painting W

alls

Eating Pasta

Exercises

MountainHiking

Turtles

User 1User 2

Figure 3: Event-level retrieval results.

gapore, Nanyang Technological University, Singapore. email:[email protected]).

8. REFERENCES[1] A. Dehghan, E. G. Ortiz, G. Shu, and S. Z. Masood.

Dager: Deep age, gender and emotion recognitionusing convolutional neural network. arXiv preprint

arXiv:1702.04280, 2017.[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and

L. Fei-Fei. ImageNet: A Large-Scale HierarchicalImage Database. In CVPR09, 2009.

[3] A. Garcia del Molino, M. Bappaditya, J. Lin, J.-H.Lim, S. Vigneshwaran, and C. Vijay. VC-I2R atImageCLEF2017: Ensemble of deep learned featuresfor lifelog video summarization. In CLEF working

notes, CEUR, 2017.[4] A. Garcia del Molino, Q. Xu, and J.-H. Lim.

Describing lifelogs with convolutional neural networks:A comparative study. In Proceedings of the first

Workshop on Lifelogging Tools and Applications,pages 39–44. ACM, 2016.

[5] C. Gurrin, H. Joho, F. Hopfgartner, L. Zhou, D.-T.Dang-Nguyen, R. Gupta, and R. Albatal. Overview ofNTCIR-13 lifelog-2 task. In Proceedings of NTCIR-13,

Tokyo, Japan, 2017.[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual

learning for image recognition. In CVPR, 2016.[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton.

Imagenet classification with deep convolutional neuralnetworks. In Advances in Neural Information

Processing Systems, 2012.

[8] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B.Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollar,and C. L. Zitnick. Microsoft COCO: common objectsin context. In ECCV, 2014.

[9] S. Z. Masood, G. Shu, A. Dehghan, and E. G. Ortiz.License plate detection and recognition using deeplylearned convolutional neural networks. arXiv preprint

arXiv:1703.07330, 2017.[10] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn:

Towards real-time object detection with regionproposal networks. In Advances in Neural Information

Processing Systems, 2015.[11] G. Roig, X. Boix, R. De Nijs, S. Ramos, K. Kuhnlenz,

and L. Van Gool. Active map inference in crfs fore�cient semantic segmentation. In Proceedings of the

IEEE International Conference on Computer Vision,pages 2312–2319, 2013.

[12] C. Szegedy, S. Io↵e, and V. Vanhoucke. Inception-v4,inception-resnet and the impact of residualconnections on learning. CoRR, abs/1602.07261, 2016.

[13] Q. Xu, S. Vigneshwaran, A. G. del Molino, J. Lin,F. Fang, J.-H. Lim, L. Li, and V. Chandrasekhar.Visualizing personal lifelog data for deeper insights atthe NTCIR-13 Lifelog-2 task. In Proceedings of

NTCIR-13, Tokyo, Japan, 2017.[14] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and

A. Oliva. Places: An image database for deep sceneunderstanding. In arXiv:1610.02055, 2016.

mAP

Analysis(Fine-tuning)

0.502 0.528 0.654

0.748 0.761 0.826

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Fixed Adaptive (User)

Adaptive (User + Event)

mAP

User2

User1

Table 2: E↵ect of thresholds for relevant conceptssearching.

User 1 User 2Fixed 0.502 0.748Ada (User) 0.528 0.761Ada (User+Event) 0.654 0.826

Table 3: E↵ect of temporal smoothing.Temporal Smoothing? User 1 User 2No 0.528 0.761Yes 0.543 0.789

4.2 Official ResultsThe o�cial evaluation evaluates the number of events de-

tected in a given day (compared to the ground truth) aswell as the accuracy of the event-detection process (given asliding five minute window). The used metrics are precisionand recall, and the o�cial score is the mean of precisionsover the topics. Due to the complexity and di�culty of thequeries, topics 16, 20, 23, and 24 are discarded, and onlyevaluated the rest 20 topics. The o�cial score reported forour team is 57.6%, which ranked at the first place.

4.3 AnalysisBesides the o�cial results, we also use mean Average Pre-

cision (mAP) as our own evaluation metric to study thee↵ect of key components in the proposed framework. Theresults are reported on the sampled test set with ground-truth annotated by our team.

Effect of Thresholds.We explores the e↵ect of thresholds for relevant concepts

searching. Two configurations for the thresholds are tested:(1) the thresholds are adaptive to each user, and (2) the thre-hold are adaptive to both user and event, which is more ad-vanced than the first configuration. As shown in the Table 2,both configurations outperforms fixed thresholds. Moreover,the advanced configuration improves the first one by a largemargin.

Temporal Smoothing.Table 3 studies the e↵ect of temporal smoothing to the

system, with thresholds for relevant concepts searching fixed.One may note that there are consistent improvements overboth users.

Feature Importances.Figure 2 compares how di↵erent features are important for

the retrieval task. “All” denotes all features are used, while“- NTCIR-13” means the NTCIR-13 classifier feature is re-moved from “All” in the retrieval system, same for the otherconfigurations. A lower score of “- NTCIR-13” means itcauses more performance drop, indicating the repsective fea-ture is important to the retrieval. We observe that “NTCIR-13” is the most important feature to the system, followed bytime, MSCOCO, and location among all the CNN basedfeatures.

Event-level Results.Figure 3 shows retrieval mAP for all event topics, using

0.4

0.5

0.6

0.7

0.8

0.9

mA

P

All

− NTC

IR−1

3

− Im

ageN

et1K

− Pla

ces3

65

− M

SCOCO

− Lo

catio

n

− Tim

e

− #P

eople

User 1User 2

Figure 2: Comparison of feature importances to theretrieval system.

our best model. One can see that for user 1, our system per-forms worse on topics like “Gardening”, “Grocery Shopping”and “Painting Walls”.

4.4 Application in LIT taskThe method proposed in this paper is used in [13] for

annotating activities with respect to the NTCIR-13 Lifelog-2 Lifelog Insight Task (LIT). Ten activities are defined forLIT, namely, eating, walking, running, hiking, gym/yoga,socializing, taking bus, driving a car/taking a taxi, takingtrain, and in a flight. Similar to the LSAT topics, the LITactivities have varying level of abstraction and the numberof incidences ranges from a few to thousands. Our algorithmachieves similar level of precision and recall in the LIT activ-ity retrieval. The result has been e↵ectively used for insightsgeneration.

5. CONCLUSIONSThis paper focuses on the problem of event driven lifelog

image retrieval. We presented a general deep learning basedframework to address a major challenge of the task - bridgingthe gap between visual images and high-level event concepts.We submitted the generated retrieval results to the NTCIR-13 Lifelog-2 Lifelog Semantic Access Task. Promising resultshas been o�cially reported, demonstrating the e↵ectivenessof the proposed retrieval system.

6. ACKNOWLEDGEMENTSThe work is funded by the Singapore A*STAR JCO VIP

REVIVE Project (1335h0009).

7. ADDITIONAL AUTHORSLiyuan Li (Institute for Infocomm Research, A*STAR,

Singapore. email: [email protected]) and Vijay Chan-drasekhar (Institute for Infocomm Research, A*STAR, Sin-

User2

User1

Featureimportance

Decreaseinperformancewhenweremoveonetypeof

feature.Thebiggerthedecrease,themore

importantthefeature.

Effectoftemporalsmoothing

Whethertemporalsmoothing

isperformedornot

Effectofthresholdforrelevantconceptsearching

Semanticconceptswhich

activationlevelisabovethethresholdisconsideredrelevant

tothequerytopic

0.528 0.543

0.761 0.789

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

No smoothing Temporal smoothing

mAP

LIT

Summary

EffectiveLifelogImage

Retrieval

HighQualityData

GoodSemanticFeatures

ReasonableGroundTruth

IntelligenceinInterpretation

ofQueryTopics

IntelligenceinModelFine-

tuning

•  Alotoffine-tuningandmanualinterventionareinvolvedintheretrievalàOver-fitting?

•  “Relevant”conceptsmaynotbecontributing,andviceverse.

•  Interactiveretrievalisprobablyagoodintermediatesolution.

Email:[email protected]

Date post:	12-Jul-2018
Category:	Documents
Upload:	truongthuy
View:	214 times
Download:	0 times

VCI R at the NTCIR-13 Lifelog-2 LSAT...

Documents