Date post: | 12-Jul-2018 |
Category: |
Documents |
Upload: | truongthuy |
View: | 214 times |
Download: | 0 times |
VCI2RattheNTCIR-13Lifelog-2LSATTaskPresentedby:QianliXu
Co-authors:JieLin,AnadelMolino,QianliXu,FenFang,V.Subbaraju,Joo-HweeLim,LiyuanLi,V.ChandrasekharOrganization:InstituteforInfocommResearch,A*STAR,
Singapore
AboutVCI2R• InstituteforInfocommResearch
(I2R),A*STAR,Singapore– VisualComputing– HumanLanguageTech– DataAnalytics– NeuralBiomedicalTech– etc.
• VisualComputingDepartment– Video/imageanalytics&search– Augmentedvisualintelligence– Visualinspection
Website:www.a-star.edu.sg/i2r/
QueryTopic
Object Classifier
PlacesClassifier
ObjectDetector
NTCIR-13Classifier
Time tag
Loc tag
# People
LifelogImages
Training Images
CNN
Faster RCNN
User-given
Online
Offline
w1
w2
w3
w4
w5
w6
w7
Feature weight Relevant concepts
TemporalSmoothing
… …
LSATFrameworkImage+Metadata
QueryTopics SemanticGap
• Relevantconcepts:WhataretheCNNpredicationsrelevanttoquerytopics?
• Featureweighting:Whichfeaturescontributethemost?
• Temporalsmoothing:Temporalcoherence,removeoutliers
• Postfiltering:refinesearchusinglocation(GPS)andTime
“Castle@Night”“Workinginacoffeeshop”“Gardeninginmyhome”
delMolino,etal.,2017,VC-I2RatImageCLEF2017:Ensembleofdeeplearnedfeaturesforlifelogvideosummarization.CLEFWorkingNotes,CEUR.
1.GettingtheBasicSemantics• CNNclassifiers
– Object:ResNet152–ImageNet1K– Place:ResNet152–Place365
• CNNdetector– FasterR-CNN–MSCOCO(80)
• NTCIR-13classifier– VGG-16–ImageNet1K– Replacethelastlayer(1Kneurons)with634neurons– Sigmoidastheactivationfunction
• Humandetectionandcounting– Sighthound(https://www.sighthound.com)
2.Aggregating&WeighingFeaturesP R F1
0
0.2
0.4
0.6
0.8
1
task 1 (X=400)
q=50; w=2.7.0.0.0.6.0.1; win=1q=50; w=2.7.0.0.0.6.0.10; win=1q=50; w=2.7.0.0.0.6.0.2; win=1q=50; w=2.7.0.0.0.8.0.1; win=1q=50; w=2.7.0.0.0.8.0.10; win=1
P R F1
0
0.2
0.4
0.6
0.8
1
task 2 (X=400)
q=0; w=6.4.−4.5.0.1.4.1; win=1q=0; w=6.4.−4.5.0.1.6.1; win=1q=0; w=6.4.−4.5.0.2.4.1; win=1q=0; w=6.4.−4.5.0.2.6.1; win=1q=0; w=6.4.−4.5.−2.1.4.1; win=1
P R F1
0
0.2
0.4
0.6
0.8
1
task 5 (X=400)
q=35; w=4.8.−3.2.0.0.0.1; win=1q=35; w=4.8.−3.2.0.0.1.1; win=1q=35; w=4.8.−3.2.0.0.2.1; win=1q=35; w=4.8.−5.2.0.0.0.1; win=1q=35; w=4.8.−5.2.0.0.1.1; win=1
P R F1
0
0.2
0.4
0.6
0.8
1
task 6 (X=400)
q=0; w=2.0.0.8.0.1.1.1; win=3q=10; w=2.0.0.8.0.1.1.1; win=3q=0; w=2.0.0.8.0.1.1.2; win=3q=10; w=2.0.0.8.0.1.1.2; win=3q=0; w=2.0.0.8.0.1.2.1; win=3
P R F1
0
0.2
0.4
0.6
0.8
1
task 10 (X=400)
q=35; w=0.0.0.3.0.3.1.1; win=3q=35; w=0.2.0.3.0.3.1.1; win=9q=35; w=0.2.−2.3.0.3.1.1; win=5q=35; w=0.0.−1.3.0.3.1.1; win=3q=35; w=0.2.0.3.0.3.1.1; win=5
Fig. 3: Weight learning from the development set. Description of the parameters:q = quality threshold; w =[wcoco, wobjy, wobjn, wply, wpln, wloc, wact, wppl];win = size of the smoothing window. (Best viewed in color.)
Objects Places MSCOCOTask Relevant Avoid Relevant Avoid Relevant
1 computergroup meeting
- computergroup meeting
etc.
- laptopkeyboard
2 televisionfoodglass
computergroup meeting
living roomtelevision room
etc.
conference roomlecture room
etc.
tvremoteetc.
3 computergroup meeting
o�ce co↵ee shopliving room
etc.
conference roomo�ceetc.
laptopkeyboard
4 computerpencil
notebook
o�ce living roomhotel room
etc.
conference roomo�ceetc.
laptopbooketc.
5 foodglass
drumwhite goods
menu’
food courtrestaurant
etc.
- forksandwich
etc.6 drink
glassbeverage
computer barpubetc
home bottlewine glass
7 - public transport templepalaceetc.
residential neigh.busetc.
-
8 public transport cabcar seattaxi
bus interiorsubway station
etc.
car interior bustrain
9 foodcooking utensilwhite goods
- pantrykitchenetc.
living room ovenrefrigerator
etc.10 shopping
shop- supermarket
storeetc.
shopfrontshopping mall
etc.
-
Table 1: Semantic queries for the retrieval task: Concepts to search in WordNetto find all related (and to avoid) ImageNet classes, manual selection of relevant(and to avoid) places classes, and objects to detect.
CRFforFeatureweighingthataccommodatesindividualdifferences
Relevancemappingforeachtopic
40% of the tested cases, the users consider the summary ob-tained with AVS better than any other summary, includingthe summaries annotated with manual tools by other users.
2 General Overview of Active Video
Summarization
The aim of AVS is to provide a customized summary with aslittle effort as possible from the user side. The system firstasks for the user’s initial preferences, selected from a setof items, i.e. the most frequent items in the original video.Then, the user’s preferences are further refined through aquestion-asking inference.
AVS asks the user specific questions about segments ofthe video. It shows one selected segment, and asks the fol-lowing two binary questions: Would you want this segmentto be in the final summary?, and Would you want to includesimilar segments? Additionally, the user can decide at anytime to go through the segments in the summary, and givesuch feedback about them. Although AVS is not limited tothese two questions, experiments show that they are effectivein practice, and they serve us as a proof of concept. Note thatthe original video is not shown to the user, as the summaryand the segments shown during the interaction provide anaccurate idea of the content of the video in much less time.
Thus, AVS can be divided into two inference problems:(i) infer the customized summary, and (ii) infer the nextquestion to ask. We use a probabilistic approach basedon a Conditional Random Fields (CRFs) to infer the mostlikely summary, and to estimate the next question to ask.CRFs are sound probabilistic models that have been suc-cessfully applied in many computer vision and multimediaproblems (Lafferty, McCallum, and Pereira 2001). In the fol-lowing sections, we introduce CRFs to infer the customizedsummary, and then, the algorithm that infers the questions toask. They are summarized in Alg. 1.
3 Inference of the Customized Summary
Let s = {si} be the set of random variables that representthe summary of the video by indicating whether a segment(or subshot) of the video appears in the summary or not.Thus, si 2 {0, 1}, where si is equal to 1 when the seg-ment is included in the summary, and 0 otherwise. We de-note P (s|✓) as the probability density distribution of howlikely the summary s is preferred by the user. We model thisdistribution with a CRF, and ✓ are the values for the po-tentials of the CRF, that depend on the input video and theuser’s preferences.
A CRF models the probability density with a Gibbs dis-tribution, c.f. (V. and Wainwright 2005). Therefore, P (s|✓)can be written as the normalized exponential of an energyfunction, which is denoted as E✓(s). The energy function isthe sum of a set of potentials, which are functions that take asinput a subset of {si}. The summary of the video, which isdenoted as s?✓ , is obtained by inferring the Maximum a Pos-teriori (MAP), i.e. s?✓ = argmaxs P (s|✓), or equivalently,maximizing the energy function E✓(s).
In the following, we first introduce the potentials of theCRF, and then the algorithm to obtain the MAP summary.
3.1 CRF for Customized Summarization
We follow most methods in the literature, that select repre-sentative and diverse segments with as little motion as pos-sible. To do so, we define the energy function of the CRFas
E✓(s) = �X
i
�u(si)| {z }unary
+X
ij
�p(si, sj)| {z }pairwise
, (1)
where the unary potentials enforce the selection of static seg-ments, the pairwise potentials encourage segments with di-verse semantic content, and � is a parameter that weightsthe unary potentials with respect to the pairwise. There is aunary potential for each segment of the video, and one pair-wise potential for each pair of similar segments. The lengthof the summary is controlled during the inference of theMAP summary by adding additional constraints to the en-ergy function that control the length of the summary, as weshow below.
Next, we introduce the potentials, and we make empha-sis on the update of the potentials when new user’s prefer-ences are known. Note that we omit the dependency of thepotentials on ✓ for simplicity, and the parameters that weintroduce in the following should be considered as part of✓. Also, the values of the parameters of the potentials areintroduced in the implementation details in 5.2.Unary Potentials. The unary potentials, {�u(si)}, encour-age selecting segments that the user will probably like.�u(si) is equal to QiI[si = 1] + LI[si = 0], in which:I[a] is an indicator function that is 1 if a is true and 0 oth-erwise; Qi is a function representing how well that segmentrelates to the requirements individually; and L is a constantoffset that is set during the MAP inference of the summaryin order to adjust the summary length (sec. 3.2).
During the on-line interaction phase, when the user rec-ommends to include a segment si, Qi is increased by � toenforce the selection of that segment, otherwise Qi is de-creased by �.Pairwise Potentials. The pairwise potentials, {�p(si, sj)},are defined between each pair of similar segments, and en-force selecting segments with diverse content.
Let d( i, j) be the Euclidean distance between the de-scriptors of two segments (details in sec. 5.2). The pair-wise potential enforces that similar segments should notbe included in the summary. To do so, we define a po-tential that is weighted by the distance between descrip-tors, i.e. �p(si, sj) = exp (�d( i, j))�
0p(si, sj), in which
�0p(si, sj) enforces that both segments should not be se-
lected at the same time, and the term exp (�d( i, j)) re-duces the effect of �0
p(si, sj) when the segments are dissimi-lar. In this way, only a representative segment among similarsegments is selected.
Specifically, �0p(si, sj) is defined as
�0p(si, sj) =
(L↵ if si = sj = 0
�L� if si = sj = 1� if si 6= sj
, (2)
where � is the cost of selecting only one segment in the pair,↵ and � are the cost to discard or select both segments, re-spectively, and L is a variable parameter that controls the
ImageNet1K
Places365
MSCOCO
NTCIR
Time
#People
Locationtag
TrainingImages
w1w2w3w4w5w6w7w8w9w10
w11
w12
Featureweight Relevantconcepts
3.TemporalSmoothing• Adjacentlifelogimagesmay
sharesimilarevent.• Temporalsmoothingisused
toensurethesemanticcoherence.
• Atriangularwindowofsizewisused.wisadaptivetoeventtopics.
4.Post-filtering• Increasediversityofretrieved
images(avoidretrievingimagesofthesameevent)
• Usetimeandlocation(GPS)tofilterimages
• Excludeimagesthatarecloserintimeandlocation.
Result
• Officialscore(precision):57.6%
0
0.2
0.4
0.6
0.8
1
mAP
Eat Lunch
Gardening
Castle at N
ight
CoffeeSunset
Graveyard
Lecturing
Shopping
Working Late
On Computer
CookingFlying
Juice
Photo of Sea
Beers in Bar
Greek Amphit
TV Recording
Work w Coffee
Painting W
alls
Eating Pasta
Exercises
MountainHiking
Turtles
User 1User 2
Figure 3: Event-level retrieval results.
gapore, Nanyang Technological University, Singapore. email:[email protected]).
8. REFERENCES[1] A. Dehghan, E. G. Ortiz, G. Shu, and S. Z. Masood.
Dager: Deep age, gender and emotion recognitionusing convolutional neural network. arXiv preprint
arXiv:1702.04280, 2017.[2] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and
L. Fei-Fei. ImageNet: A Large-Scale HierarchicalImage Database. In CVPR09, 2009.
[3] A. Garcia del Molino, M. Bappaditya, J. Lin, J.-H.Lim, S. Vigneshwaran, and C. Vijay. VC-I2R atImageCLEF2017: Ensemble of deep learned featuresfor lifelog video summarization. In CLEF working
notes, CEUR, 2017.[4] A. Garcia del Molino, Q. Xu, and J.-H. Lim.
Describing lifelogs with convolutional neural networks:A comparative study. In Proceedings of the first
Workshop on Lifelogging Tools and Applications,pages 39–44. ACM, 2016.
[5] C. Gurrin, H. Joho, F. Hopfgartner, L. Zhou, D.-T.Dang-Nguyen, R. Gupta, and R. Albatal. Overview ofNTCIR-13 lifelog-2 task. In Proceedings of NTCIR-13,
Tokyo, Japan, 2017.[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual
learning for image recognition. In CVPR, 2016.[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neuralnetworks. In Advances in Neural Information
Processing Systems, 2012.
[8] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B.Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollar,and C. L. Zitnick. Microsoft COCO: common objectsin context. In ECCV, 2014.
[9] S. Z. Masood, G. Shu, A. Dehghan, and E. G. Ortiz.License plate detection and recognition using deeplylearned convolutional neural networks. arXiv preprint
arXiv:1703.07330, 2017.[10] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn:
Towards real-time object detection with regionproposal networks. In Advances in Neural Information
Processing Systems, 2015.[11] G. Roig, X. Boix, R. De Nijs, S. Ramos, K. Kuhnlenz,
and L. Van Gool. Active map inference in crfs fore�cient semantic segmentation. In Proceedings of the
IEEE International Conference on Computer Vision,pages 2312–2319, 2013.
[12] C. Szegedy, S. Io↵e, and V. Vanhoucke. Inception-v4,inception-resnet and the impact of residualconnections on learning. CoRR, abs/1602.07261, 2016.
[13] Q. Xu, S. Vigneshwaran, A. G. del Molino, J. Lin,F. Fang, J.-H. Lim, L. Li, and V. Chandrasekhar.Visualizing personal lifelog data for deeper insights atthe NTCIR-13 Lifelog-2 task. In Proceedings of
NTCIR-13, Tokyo, Japan, 2017.[14] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and
A. Oliva. Places: An image database for deep sceneunderstanding. In arXiv:1610.02055, 2016.
mAP
Analysis(Fine-tuning)
0.502 0.528 0.654
0.748 0.761 0.826
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Fixed Adaptive (User)
Adaptive (User + Event)
mAP
User2
User1
Table 2: E↵ect of thresholds for relevant conceptssearching.
User 1 User 2Fixed 0.502 0.748Ada (User) 0.528 0.761Ada (User+Event) 0.654 0.826
Table 3: E↵ect of temporal smoothing.Temporal Smoothing? User 1 User 2No 0.528 0.761Yes 0.543 0.789
4.2 Official ResultsThe o�cial evaluation evaluates the number of events de-
tected in a given day (compared to the ground truth) aswell as the accuracy of the event-detection process (given asliding five minute window). The used metrics are precisionand recall, and the o�cial score is the mean of precisionsover the topics. Due to the complexity and di�culty of thequeries, topics 16, 20, 23, and 24 are discarded, and onlyevaluated the rest 20 topics. The o�cial score reported forour team is 57.6%, which ranked at the first place.
4.3 AnalysisBesides the o�cial results, we also use mean Average Pre-
cision (mAP) as our own evaluation metric to study thee↵ect of key components in the proposed framework. Theresults are reported on the sampled test set with ground-truth annotated by our team.
Effect of Thresholds.We explores the e↵ect of thresholds for relevant concepts
searching. Two configurations for the thresholds are tested:(1) the thresholds are adaptive to each user, and (2) the thre-hold are adaptive to both user and event, which is more ad-vanced than the first configuration. As shown in the Table 2,both configurations outperforms fixed thresholds. Moreover,the advanced configuration improves the first one by a largemargin.
Temporal Smoothing.Table 3 studies the e↵ect of temporal smoothing to the
system, with thresholds for relevant concepts searching fixed.One may note that there are consistent improvements overboth users.
Feature Importances.Figure 2 compares how di↵erent features are important for
the retrieval task. “All” denotes all features are used, while“- NTCIR-13” means the NTCIR-13 classifier feature is re-moved from “All” in the retrieval system, same for the otherconfigurations. A lower score of “- NTCIR-13” means itcauses more performance drop, indicating the repsective fea-ture is important to the retrieval. We observe that “NTCIR-13” is the most important feature to the system, followed bytime, MSCOCO, and location among all the CNN basedfeatures.
Event-level Results.Figure 3 shows retrieval mAP for all event topics, using
0.4
0.5
0.6
0.7
0.8
0.9
mA
P
All
− NTC
IR−1
3
− Im
ageN
et1K
− Pla
ces3
65
− M
SCOCO
− Lo
catio
n
− Tim
e
− #P
eople
User 1User 2
Figure 2: Comparison of feature importances to theretrieval system.
our best model. One can see that for user 1, our system per-forms worse on topics like “Gardening”, “Grocery Shopping”and “Painting Walls”.
4.4 Application in LIT taskThe method proposed in this paper is used in [13] for
annotating activities with respect to the NTCIR-13 Lifelog-2 Lifelog Insight Task (LIT). Ten activities are defined forLIT, namely, eating, walking, running, hiking, gym/yoga,socializing, taking bus, driving a car/taking a taxi, takingtrain, and in a flight. Similar to the LSAT topics, the LITactivities have varying level of abstraction and the numberof incidences ranges from a few to thousands. Our algorithmachieves similar level of precision and recall in the LIT activ-ity retrieval. The result has been e↵ectively used for insightsgeneration.
5. CONCLUSIONSThis paper focuses on the problem of event driven lifelog
image retrieval. We presented a general deep learning basedframework to address a major challenge of the task - bridgingthe gap between visual images and high-level event concepts.We submitted the generated retrieval results to the NTCIR-13 Lifelog-2 Lifelog Semantic Access Task. Promising resultshas been o�cially reported, demonstrating the e↵ectivenessof the proposed retrieval system.
6. ACKNOWLEDGEMENTSThe work is funded by the Singapore A*STAR JCO VIP
REVIVE Project (1335h0009).
7. ADDITIONAL AUTHORSLiyuan Li (Institute for Infocomm Research, A*STAR,
Singapore. email: [email protected]) and Vijay Chan-drasekhar (Institute for Infocomm Research, A*STAR, Sin-
User2
User1
Featureimportance
Decreaseinperformancewhenweremoveonetypeof
feature.Thebiggerthedecrease,themore
importantthefeature.
Effectoftemporalsmoothing
Whethertemporalsmoothing
isperformedornot
Effectofthresholdforrelevantconceptsearching
Semanticconceptswhich
activationlevelisabovethethresholdisconsideredrelevant
tothequerytopic
0.528 0.543
0.761 0.789
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
No smoothing Temporal smoothing
mAP
LIT
Summary
EffectiveLifelogImage
Retrieval
HighQualityData
GoodSemanticFeatures
ReasonableGroundTruth
IntelligenceinInterpretation
ofQueryTopics
IntelligenceinModelFine-
tuning
• Alotoffine-tuningandmanualinterventionareinvolvedintheretrievalàOver-fitting?
• “Relevant”conceptsmaynotbecontributing,andviceverse.
• Interactiveretrievalisprobablyagoodintermediatesolution.
Email:[email protected]